Sum of Squares Notes:
\[ S_{xx} = \sum^n_{i=1}(x_i-\bar{x})^2,\quad s_{x}^2 = \frac{S_{xx}}{n-1}\] \[ S_{yy} = \sum^n_{i=1}(y_i-\bar{y})^2,\quad s_{y}^2 = \frac{S_{yy}}{n-1}\] \[ S_{xy} = \sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y}),\quad s_{xy} = \frac{S_{xy}}{n-1}\]
Simple Linear Regression
- The simple linear regression is a basic statistical learning model that models a linear relationship between a single independent and dependent variable.
\[ Y = \beta_0 + \beta_1x_i +\epsilon\]
- In order to estimate this model, we take a parametric approach where we estimate each beta. This is done by minimising the least squares criterion, also known as the residual sum of squares (RSS). This is minimising the residual/distances between the fitted line and the data points. \[ \min_{\beta_0,\beta_1}\left[\sum^n_{i=1}\left(y_i-(\hat{\beta}_0+\hat{\beta}_1x_i)\right)^2\right] \]
- This results in the least square estimates: \[\hat{\beta}_1 = \frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} \] \[\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]
Simple Linear Regression
Under the weak assumptions: \[ \mathbb{E}(\epsilon_i|X=\textbf{x})=0,\quad\mathbb{V}(\epsilon_i|X=\textbf{x})=\sigma^2,\quad\mathbb{C}ov(\epsilon_i,\epsilon_j|X=\textbf{x})=0\]
- We have the unbiased estimators: \[ \mathbb{E}[\hat{\beta}_0|X=\textbf{x}]=\beta_0,\quad\mathbb{E}[\hat{\beta}_1|X=\textbf{x}]=\beta_1\]
- Also noting the unbiased estimator of \(\sigma^2\): \[ s^2 = \frac{\sum^n_{i=1}\hat{\epsilon}_i^2}{n-2}=\frac{RSS}{n-2}=RSE^2\]
- Variance/Covariance of parameters: \[ \mathbb{V}(\hat{\beta}_0|X=\textbf{x})=\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\right),\quad\mathbb{V}(\hat{\beta}_1|X=\textbf{x})=\frac{\sigma^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\] \[\mathbb{C}ov(\hat{\beta}_0,\hat{\beta}_1|X=\textbf{x})=-\frac{\bar{x}\sigma^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\]
- We have the unbiased estimators: \[ \mathbb{E}[\hat{\beta}_0|X=\textbf{x}]=\beta_0,\quad\mathbb{E}[\hat{\beta}_1|X=\textbf{x}]=\beta_1\]
Under the strong assumption: \[ \epsilon|X=\textbf{x}\sim^{iid}N(0,\sigma^2)\]
We can find the MLE estimates using the likelihood of \(\epsilon\). These estimates are the same as with least squares, except the variance is divided by \(n\) rather than \(n-1\) \[ \ell(\mathbf{y},\beta_0,\beta_1,\sigma) = -n\log(\sqrt{2\pi}\sigma)-\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i-(\beta_0+\beta_1x_i))^2\]
Test statistic for each beta: \[ \frac{\hat{\beta}_i-\tilde{\beta}_i}{\hat{SE}(\hat{\beta}_i)} \sim t_{n-2}\]
We often test if the variable is significant or not, which is equivalent to the hypothesis test: \[ H_0: \beta_i=0,\quad H_1:\beta_1\neq 0\]
Here are the relevant rejection regions:
Rejection Rules
Partioning the variabilitiy
- We can assess how well a linear model explains the trend in data by splitting up the causes of variances.
SST is the variability if we did the most simple model, averaging the data, with no use of \(X\)
SSE is the total variability remaining after introducing the effects of \(X\)
SSM is the total variability explained because of the knowledge of \(X\) \[ \underbrace{\sum^n_{i=1}(y_i-\bar{y})^2}_{SST}=\underbrace{\sum^n_{i=1}(y_i-\hat{y}_i)^2}_{SSE}+\underbrace{\sum^n_{i=1}(\hat{y}_i-\bar{y})^2}_{SSM} \]
The ANOVA table for the linear regression is as follows:
Rejection Rules
- We can compute the ANOVA table for a model in R, and we can even compare 2 different models using the function:
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## TV 1 4512.4 4512.4 1634.2115 <2e-16 ***
## Radio 1 502.3 502.3 181.9255 <2e-16 ***
## Newspaper 1 0.0 0.0 0.0034 0.9538
## Residuals 196 541.2 2.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: Sales ~ TV
## Model 2: Sales ~ TV + Radio + Newspaper
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 1043.5
## 2 196 541.2 2 502.35 90.964 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- A measure of model fit is \(R^2\), which explains the total variation in the \(y_i\)’s that is explained by the variable x in a linear regression model: \[ R^2 = \left(\frac{S_{xy}}{\sqrt{S_{xx}\dot S_{yy}}}\right)^2=\frac{SSM}{SST}=1-\frac{SSE}{SST}\]
F Test
\[ F = \frac{SSM/1}{RSS/n-2}\sim F_{1,n-2}\]
Mean Response
- When predicting the value of an out of sample regressor we want to predict the response variable Y associated with it. The mean of this prediction Y is:
\[ \mathbb{E}[Y|x_0] = \mathbb{E}[\beta_0+\beta_1x|x=x_0] = \beta_0+\beta_1x_0\]
- Resulting in the unbiased estimator for the mean being the fitted value of \(y_0\)
\[ \hat{y}_0=\hat{\beta}_0+\hat{\beta}_1x_0\]
- With variance:
\[ \mathbb{V}(\hat{y}_0) = \left(\frac{1}{n}+\frac{(\bar{x}-x_0)^2}{S_{xx}}\right)\sigma^2 = SE(\hat{y}_0)^2 \] * Under strong assumptions we have:
\[ \hat{y}_0\sim N(\beta_0+\beta_1x_0, SE(\hat{y}_0)) \]
\[ \frac{\hat{y}_0-(\beta_0+\beta_1x_0)}{SE(\hat{y}_0)} \sim t(n-2)\] * In R we can predict the mean response with the confidence interval:
## fit lwr upr
## 1 18.06778 17.69139 18.44416
Individual Response
- A prediction individual is a confidence interval for the actual value of a \(Y_i\) rather than its mean. We consider the error in the prediction:
\[ \mathbb{E}[Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i] = 0 \] \[ \mathbb{V}(Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i) = \sigma^2\left(1+\frac{1}{n}+\frac{(\bar{x}-x)^2}{S_xx}\right) \]
- Where we can utilise the following results for inference of the prediction interval:
\[ (Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i) \sim N\left(0, \sigma^2\left(1+\frac{1}{n}+\frac{(\bar{x}-x)^2}{S_{xx}}\right)\right)\] \[ \frac{Y_i-\hat{y}_i}{s\sqrt{1+\frac{1}{n}+\frac{(\bar{x}-x_i)^2}{S_{xx}}}}\sim t_{n-2}\] * In R we can predict the mean response with the confidence interval:
## fit lwr upr
## 1 18.06778 13.52491 22.61065
Plotting with CI and PI
pred.int = predict(regressorsimple, interval = 'prediction')
(ggplot(advertising, aes(TV,Sales))+
geom_point()+
stat_smooth(method = lm)+
geom_line(aes(y=pred.int[,2]), color = 'red', linetype = 'dashed')+
geom_line(aes(y=pred.int[,3]), color = 'red', linetype = 'dashed')+
theme_classic())Multiple Linear Regression
A lot of the same ideas can be extended to the multi dimensional case of linear regression. That is linear regression with \(p>1\), where each \(\beta_j\) is the average effect on \(Y\) of a one unit increase in \(X_j\) holding all other factors fixed.
\[ Y=\beta_0 +\beta_1X_1 + ... + \beta_pX_p + \epsilon \]
- We often write it in the following matrix notation form, where \(\mathbf{X}\) is of size \((n,p+1)\):
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\]
In R to fit the model:
- We can also see what is stored in the object usign the names() command:
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
- Weak assumptions are similar to simple:
\[\mathbb{E}[\boldsymbol{\epsilon}] = \mathbb{0}\] \[ \mathbb{C}ov(\boldsymbol{\epsilon})=\sigma^2\mathbf{I}\]
- Strong assumptions are similar to simple:
\[ \epsilon_i|\mathbf{X}=\mathbf{x}\sim^{i.i.d} N(0,\sigma^2) \]
Least Squares Estimates:
\[ RSS = \sum^n_{i=1}\hat{\epsilon}^2_i = ||\mathbf{y}- \mathbf{X}\boldsymbol{\beta}||\] \[ \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]
In R this can be accessed in summary or accessing the object:
## (Intercept) TV Radio Newspaper
## 4.6251240788 0.0544457803 0.1070012282 0.0003356579
With corresponding prediction vector:
\[ \hat{\mathbf{y}}=\mathbf{X}\hat{\boldsymbol{\beta}}\]
Properties of LSE are similar to simple:
\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\] \[ \mathbf{V}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\]
## (Intercept) TV Radio Newspaper
## 0.307501165 0.001375188 0.008489563 0.005788056
With the unbiased estimator of \(\sigma^2\)
\[ s^2 = \frac{||\mathbf{y}-\hat{\mathbf{y}}||}{n-p-1} \]
## [1] 0.1186925
- With the distribution of \(\beta_j\) used for inferernce:
\[ \frac{\hat{\beta}_j-\beta_j}{SE(\hat{\beta}_j)}\sim t_{n-p-1}\]
F Test tests the hypothesis:
\[ H_0 : \beta_0 = ... = \beta_p = 0\] \[ H_a: \text{ atleast one } \beta_j \text{ is not 0}\] \[ F = \frac{(SST-SSE)/p}{SSE/(n-p-1)}\sim F_{p.n-p-1}\]
- In the case we want to test a subset of predictors \(q\), we fit a model that uses all variables except the \(q\) predictors, with corresponding \(RSS_0\)
\[ H_0: \beta_{p-q+1} = ... = \beta_p = 0\] \[ F = \frac{(RSS_0-RSS)/q}{RSS/(n-p-1)}\]
To see the results of many of these hypothesis tests, as well as other statistics, we can use the summary() command:
##
## Call:
## lm(formula = Sales ~ ., data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3034 -0.8244 -0.0008 0.8976 3.7473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6251241 0.3075012 15.041 <2e-16 ***
## TV 0.0544458 0.0013752 39.592 <2e-16 ***
## Radio 0.1070012 0.0084896 12.604 <2e-16 ***
## Newspaper 0.0003357 0.0057881 0.058 0.954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.662 on 196 degrees of freedom
## Multiple R-squared: 0.9026, Adjusted R-squared: 0.9011
## F-statistic: 605.4 on 3 and 196 DF, p-value: < 2.2e-16
Possible Extensions/Considerations to the Linear Model:
Qualitative Predictors
- Qualitative predictors have k levels, which are unordered. To handle this we create k-1 dummy variables (for a factor with k levels) that act as binary variables.
- A simple way to consider the factor model (with 2 different factors) is the following:
- \(\mu\) as the grand mean
- \(\alpha_i\) is the effect for the ith factor 1 group
- \(\beta_j\) is the effect for the jth factor 2 group
- We can look at the 2 factor example with the planes data set, with 2 factors (Paper and Plane) and response Distance.
planes = read.table('planes.txt', header = T)[c(1,2,4)]
planes$Paper = as.factor(planes$Paper)
planes$Plane = as.factor(planes$Plane)- We can look at the different levels for the two factors and plot of global mean:
- To fit a model that consists of only factors, the function aov() is preferred:
## Df Sum Sq Mean Sq F value Pr(>F)
## Paper 1 1718721 1718721 0.678 0.425
## Plane 1 385641 385641 0.152 0.703
## Residuals 13 32947925 2534456
- A simple model for an interaction between two factors:
\[ y_{ijk}=\mu+\alpha_i+\beta_j+\underbrace{(\alpha\beta)_{ij}}_{\text{Interaction}}+\epsilon_{ijk}\] + In R interaction effects are represented by including products of the dummy variables for each of the factors as additional predictors * We can explore interactions between factors in an interaction plot: + Parallel Lines are indicative of no interaction
- To fit a model with interaction terms we use the : operator
## Df Sum Sq Mean Sq F value Pr(>F)
## Plane:Paper 3 25491258 8497086 10.66 0.00106 **
## Residuals 12 9561029 796752
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Removing the Additive Assumption
One of the most important assumptions in the linear regression is that the relationship between the predictors and response are additive and linear.
- The additive assumption means that the effect of changes in a predictor \(X_j\) on the response \(Y\) is independent of other predictors.
One way to relax this assumption is to allow for interaction between predictors. For example:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon\]
- Which can be interpreted as the following: \[ Y = \beta_0 + (\beta_1 + \beta_3X_2)X_1+\beta_2X_2+\epsilon\]
If the p-value for the interaction term is significant, indicates that it is highly possible that the true relationship is not additive.
The heirachal principle states that if we want to include an interaction term, we should still include the original predictors, even if they have low signifiance.
Possible Issues with a Linear Model:
- A lot of these issues can be found through inspection of the diagnostic plots:
Non Linearity of response-predictor relationship
Linear regression assumes there is a straight-line relationship between the \(\mathbf{X}\) and \(\mathbf{y}\).
Residual plots are useful for identifying non-linearity.
- For a simple linear model we can plot the residuals \(y_i-\hat{y}_i\) against the predictor \(x_i\)
- For higher dimension models the residuals \(y_i-\hat{y}_i\) are plotted against the fitted values \(\hat{y}_i\)
If these plots exhibit a pattern, it is a strong indiciation of non-linearity in the data. In the following, a quadratic fit exhibits lower residuals, indicating the fit is closer to quadratic than linear.
Residual Plot
- If there is indication of a non-linear association, a simple approach is to use a non-linear transformation of the predictors, such as \(\log(X)\).
One other alternative to get further information is to see if the predictor/response relationship of each variable is linear, using the pairs() function.
Correlation of Error Terms
- An important assumption of the linear regression model is that error terms are correlated, i.e \(\mathbb{C}ov(\epsilon_i,\epsilon_j|X=\textbf{x})=0\).
- Correlation of errors can occur as a result of correlated datapoints or in a time series (or any experiment where data is taken sequentially) where the timing of the experiment influences the error. In this case there would be noticable tracking in the residuals over time:
- Correlation between error terms is especially prevelent in time series data
Tracking residuals
- Correlated error terms results in lower estimated standard errors and therefore overconfidence in the model.
Non-constant Variance (heteroscedasticity) of error terms
- Another important assumption of the linear regression model is the constant variance of error terms, i.e \(\mathbb{V}(\epsilon_i)=\sigma^2\)
- We can indentify non-constant variance in errors in residual plots:
- A common problem is the magnitude of residuals increasing with fitted values, resulting in the funnel shape seen in the residual plot.
- One possible solution is to transform the response using a concave function, such as \(log(Y)\) or \(\sqrt(Y)\), in order to have a greater order of shrinkage of the larger responses.
Decreasing variance
Outliers
- An outlier is a point for which \(y_i\) is far from the value predicted by the model.
- Outliers can be found through inspection of the residual plot:
Outlier
A commonly used measure for leverage is the hat matrix, which is utilised in these formulas for standardisation:
\[ \mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\]
## 1 2 3 4 5
## 0.02520285 0.01941823 0.03922616 0.01660967 0.02350883
In the case of simple linear regression:
\[ h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum^n_{i'=1}(x_{i'}-\bar{x})^2}\]
We can utilise different types of residuals to determine what is truly a ‘outlier’:
Raw residuals
\[ e = y - X\hat{\beta} \]
## 1 2 3 4 5
## 0.8790279 -0.8682477 1.5037910 -0.8124465 2.2558631
Standardised residuals are expected for fall in the range \((-3,3)\)
\[ r_i = \frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}} \]
## 1 2 3 4 5
## 0.5357895 -0.5276554 0.9232633 -0.4930381 1.3738118
Studentised residuals can be used for formal testing for outliers as we know the distribution
\[ t_i = \frac{e_i}{\hat{\sigma}_{-i}\sqrt{1-h_{ii}}}\sim t_{n-p-1}\]
## 1 2 3 4 5
## 0.5348127 -0.5266819 0.9229141 -0.4920840 1.3769483
- Outliers don’t tend to have a major impact on the fit of the least squares model, but rather increases the RSE, resulting in higher standard errors (and therefore wider CI) and \(R^2\) which damages the interpretability of the fit.
High leverage points
- High leverage points are datapoints with an unusual value for \(x_i\)
- High leverage points can be inspected through leverage plots that utilise statistics such as Cook’s distance:
- The rule of thumb is to further investigate if \(D_i\) is greater than \(F_{0.5;p,n-p}\)
- In R we can find cook’s distance:
## 1 2 3 4 5
## 0.001855512 0.001378372 0.008700530 0.001026446 0.011359448
In the plots we can see that in figure 1,3 the datapoint 41 has extreme leverage and looks visibly off in the fit plot. Figure 2 shows a different scenario in the case of 2 predictors where the combination is strange despite each predictor being fine by themselves:
Leverage Diagnostic
- A rule of thumb is if a datapoint has a leverage \(h_{ii}>2(p+1)/n\) is considered having high leverage as \((p+1)/n\) is always the average leverage of a datapoint.
High leverage points impact the least squares fit heavily and may invalidate the entire fit.
Collinearity
- Collinearity refers to the situation where two ore more predictor variables are closely related to one another.
- Between two variables, we can compute the correlation or plot the predictors against eachother to determine the collinearity, often with the pairs() command. Here is a single example:
Not collinear v Collinear
- It is possible for collinearity to exist between three or more variables even if no pair exhibits high collinearity. This is referred to as multicollinearity.
- We therefore fit the VIF (variance inflation factor), which is the ratio of the variance of \(\beta_j\) when fitting the full model divdided by the variance when fitted on its own.
- \(R^2_{X_j|X_{-j}}\) is from the regression of \(X_j\) onto all other predictors
- The smallest possible value is 1
- Values above 5 often indicates a problematic amount of collinearity.
- We therefore fit the VIF (variance inflation factor), which is the ratio of the variance of \(\beta_j\) when fitting the full model divdided by the variance when fitted on its own.
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.3
## TV Radio Newspaper
## 1.004611 1.144952 1.145187
- If collinearity is present, there are a few methods to handle it
- You can drop one of the problematic variables from the regression
- Combining the collinear variables together into a single predictor, such as taking a standardised average of two predictors.
- Collinearity/Multicollinarity can pose problems as it can be difficult to seperate out the individual effects of collinear variables on the response. They can also impact the accuracy of the estimates of the regression coefficients, resulting in an increasing standard error for \(\beta_j\)
- The power of the hypothesis test suffers, the probability of correctly detecting a non-zero coefficient, as the p-value may be increased for a collinear predictor.
Confounding Effects
- Another consideration for models is when a variable \(C\), which may be unmeasurable, influences \(X\) and \(Y\) but \(X\) does not influence \(Y\) directly.
- An example of this is \(X\) = Age, \(Y\) = prob of accident. Age does not mean you are better at driving, but age can come with experience, which influences \(Y\). As age cannot be measured, \(X\) is used as a proxy for experience, even though it isn’t age itself directly influcing Y.
- In this situation if \(C\) is measurable you can add it to the model, but it is often unobservable. In these cases you need to alter the interpretation of the coefficients for the indirect effect.
KNN Regression
- KNN Regression is an example of non-parametric regression, where no assumptions about the form of \(f(X)\) is made, allowing for great flexibility in fitting the model. However due to this these methods lose interpretability.
- Parametric approaches will perform better if the parametric form is close to the true form of \(f(X)\). Consider the following example where both a linear regression and KNN regression is fit to a non-linear \(f(X)\), with the MSE plotted as a function of 1/K:
KNN v Linear
- For KNN we specify a \(K\) value for the number of nearest neighbours to consider (from the training set), denoted as \(\mathcal{N}\). When a new observation is observered \(x_0\), the k nearest neighbours, which constitute \(\mathcal{N}_0\), are averaged to create a prediction \(\hat{f}(x_0)\):
\[ \hat{f}(x_0) = \frac{1}{K}\sum_{x_i\in\mathcal{N}_0}y_i\]
- When fitting this kind of regression, we need to consider which \(K\) to use. This decision comes down to the bias/variance tradeoff.
- A small value of \(K\) provides the most flexible fit, which has low bias but high variance
- A high value of \(K\) provides a more smoother and less variable fit, but this results in higher bias.
- Here is an example of \(K = 1\) v \(K=9\)
K=1 v K=9
- While KNN may be an attractive approach due to its non parametric nature, it tends to perform worse than linear regression and other parametric methods in higher dimensions. This is because in higher dimensions there is effectively a reduction in sample size and it is harder to find nearby neighbours. This is called the curse of dimensionality.
- As a general rule rule parametric methods tend to outperform non-parametric methods when there is a small number of observations per predictor.