Sum of Squares Notes:

\[ S_{xx} = \sum^n_{i=1}(x_i-\bar{x})^2,\quad s_{x}^2 = \frac{S_{xx}}{n-1}\] \[ S_{yy} = \sum^n_{i=1}(y_i-\bar{y})^2,\quad s_{y}^2 = \frac{S_{yy}}{n-1}\] \[ S_{xy} = \sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y}),\quad s_{xy} = \frac{S_{xy}}{n-1}\]

Simple Linear Regression

The simple linear regression is a basic statistical learning model that models a linear relationship between a single independent and dependent variable.

\[ Y = \beta_0 + \beta_1x_i +\epsilon\]

In order to estimate this model, we take a parametric approach where we estimate each beta. This is done by minimising the least squares criterion, also known as the residual sum of squares (RSS). This is minimising the residual/distances between the fitted line and the data points. \[ \min_{\beta_0,\beta_1}\left[\sum^n_{i=1}\left(y_i-(\hat{\beta}_0+\hat{\beta}_1x_i)\right)^2\right] \]
- This results in the least square estimates: \[\hat{\beta}_1 = \frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2} \] \[\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]

Under the weak assumptions: \[ \mathbb{E}(\epsilon_i|X=\textbf{x})=0,\quad\mathbb{V}(\epsilon_i|X=\textbf{x})=\sigma^2,\quad\mathbb{C}ov(\epsilon_i,\epsilon_j|X=\textbf{x})=0\]
- We have the unbiased estimators: \[ \mathbb{E}[\hat{\beta}_0|X=\textbf{x}]=\beta_0,\quad\mathbb{E}[\hat{\beta}_1|X=\textbf{x}]=\beta_1\]
  - Also noting the unbiased estimator of \(\sigma^2\): \[ s^2 = \frac{\sum^n_{i=1}\hat{\epsilon}_i^2}{n-2}=\frac{RSS}{n-2}=RSE^2\]
- Variance/Covariance of parameters: \[ \mathbb{V}(\hat{\beta}_0|X=\textbf{x})=\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\right),\quad\mathbb{V}(\hat{\beta}_1|X=\textbf{x})=\frac{\sigma^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\] \[\mathbb{C}ov(\hat{\beta}_0,\hat{\beta}_1|X=\textbf{x})=-\frac{\bar{x}\sigma^2}{\sum^n_{i=1}(x_i-\bar{x})^2}\]
Under the strong assumption: \[ \epsilon|X=\textbf{x}\sim^{iid}N(0,\sigma^2)\]
- We can find the MLE estimates using the likelihood of \(\epsilon\). These estimates are the same as with least squares, except the variance is divided by \(n\) rather than \(n-1\) \[ \ell(\mathbf{y},\beta_0,\beta_1,\sigma) = -n\log(\sqrt{2\pi}\sigma)-\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i-(\beta_0+\beta_1x_i))^2\]
- Test statistic for each beta: \[ \frac{\hat{\beta}_i-\tilde{\beta}_i}{\hat{SE}(\hat{\beta}_i)} \sim t_{n-2}\]
  - We often test if the variable is significant or not, which is equivalent to the hypothesis test: \[ H_0: \beta_i=0,\quad H_1:\beta_1\neq 0\]
  - Here are the relevant rejection regions:
Rejection Rules

Partioning the variabilitiy

We can assess how well a linear model explains the trend in data by splitting up the causes of variances.
- SST is the variability if we did the most simple model, averaging the data, with no use of \(X\)
- SSE is the total variability remaining after introducing the effects of \(X\)
- SSM is the total variability explained because of the knowledge of \(X\) \[ \underbrace{\sum^n_{i=1}(y_i-\bar{y})^2}_{SST}=\underbrace{\sum^n_{i=1}(y_i-\hat{y}_i)^2}_{SSE}+\underbrace{\sum^n_{i=1}(\hat{y}_i-\bar{y})^2}_{SSM} \]
- The ANOVA table for the linear regression is as follows:
Rejection Rules
We can compute the ANOVA table for a model in R, and we can even compare 2 different models using the function:

anova(regressormultiple)

## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq   F value Pr(>F)    
## TV          1 4512.4  4512.4 1634.2115 <2e-16 ***
## Radio       1  502.3   502.3  181.9255 <2e-16 ***
## Newspaper   1    0.0     0.0    0.0034 0.9538    
## Residuals 196  541.2     2.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(regressorsimple,regressormultiple)

## Analysis of Variance Table
## 
## Model 1: Sales ~ TV
## Model 2: Sales ~ TV + Radio + Newspaper
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 1043.5                                  
## 2    196  541.2  2    502.35 90.964 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A measure of model fit is \(R^2\), which explains the total variation in the \(y_i\)’s that is explained by the variable x in a linear regression model: \[ R^2 = \left(\frac{S_{xy}}{\sqrt{S_{xx}\dot S_{yy}}}\right)^2=\frac{SSM}{SST}=1-\frac{SSE}{SST}\]

F Test

\[ F = \frac{SSM/1}{RSS/n-2}\sim F_{1,n-2}\]

Mean Response

When predicting the value of an out of sample regressor we want to predict the response variable Y associated with it. The mean of this prediction Y is:

\[ \mathbb{E}[Y|x_0] = \mathbb{E}[\beta_0+\beta_1x|x=x_0] = \beta_0+\beta_1x_0\]

Resulting in the unbiased estimator for the mean being the fitted value of \(y_0\)

\[ \hat{y}_0=\hat{\beta}_0+\hat{\beta}_1x_0\]

With variance:

\[ \mathbb{V}(\hat{y}_0) = \left(\frac{1}{n}+\frac{(\bar{x}-x_0)^2}{S_{xx}}\right)\sigma^2 = SE(\hat{y}_0)^2 \] * Under strong assumptions we have:

\[ \hat{y}_0\sim N(\beta_0+\beta_1x_0, SE(\hat{y}_0)) \]

\[ \frac{\hat{y}_0-(\beta_0+\beta_1x_0)}{SE(\hat{y}_0)} \sim t(n-2)\] * In R we can predict the mean response with the confidence interval:

predict(object = regressorsimple, newdata =data.frame(TV = 200,100),
        interval = 'confidence')

##        fit      lwr      upr
## 1 18.06778 17.69139 18.44416

Individual Response

A prediction individual is a confidence interval for the actual value of a \(Y_i\) rather than its mean. We consider the error in the prediction:

\[ \mathbb{E}[Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i] = 0 \] \[ \mathbb{V}(Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i) = \sigma^2\left(1+\frac{1}{n}+\frac{(\bar{x}-x)^2}{S_xx}\right) \]

Where we can utilise the following results for inference of the prediction interval:

\[ (Y_i-\hat{y}_i|\mathbf{X}=\mathbf{x}, X=x_i) \sim N\left(0, \sigma^2\left(1+\frac{1}{n}+\frac{(\bar{x}-x)^2}{S_{xx}}\right)\right)\] \[ \frac{Y_i-\hat{y}_i}{s\sqrt{1+\frac{1}{n}+\frac{(\bar{x}-x_i)^2}{S_{xx}}}}\sim t_{n-2}\] * In R we can predict the mean response with the confidence interval:

predict(object = regressorsimple, newdata =data.frame(TV = 200,100), 
        interval = 'prediction')

##        fit      lwr      upr
## 1 18.06778 13.52491 22.61065

Plotting with CI and PI

pred.int = predict(regressorsimple, interval = 'prediction')
(ggplot(advertising, aes(TV,Sales))+
  geom_point()+
  stat_smooth(method = lm)+
  geom_line(aes(y=pred.int[,2]), color = 'red', linetype = 'dashed')+
  geom_line(aes(y=pred.int[,3]), color = 'red', linetype = 'dashed')+
  theme_classic())

Multiple Linear Regression

A lot of the same ideas can be extended to the multi dimensional case of linear regression. That is linear regression with \(p>1\), where each \(\beta_j\) is the average effect on \(Y\) of a one unit increase in \(X_j\) holding all other factors fixed.

\[ Y=\beta_0 +\beta_1X_1 + ... + \beta_pX_p + \epsilon \]
- We often write it in the following matrix notation form, where \(\mathbf{X}\) is of size \((n,p+1)\):
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\]
In R to fit the model:

regressor = lm(Sales~., data = advertising)

We can also see what is stored in the object usign the names() command:

names(regressor)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Weak assumptions are similar to simple:

\[\mathbb{E}[\boldsymbol{\epsilon}] = \mathbb{0}\] \[ \mathbb{C}ov(\boldsymbol{\epsilon})=\sigma^2\mathbf{I}\]

Strong assumptions are similar to simple:

\[ \epsilon_i|\mathbf{X}=\mathbf{x}\sim^{i.i.d} N(0,\sigma^2) \]

Least Squares Estimates:

\[ RSS = \sum^n_{i=1}\hat{\epsilon}^2_i = ||\mathbf{y}- \mathbf{X}\boldsymbol{\beta}||\] \[ \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]
In R this can be accessed in summary or accessing the object:

regressor$coefficients

##  (Intercept)           TV        Radio    Newspaper 
## 4.6251240788 0.0544457803 0.1070012282 0.0003356579

With corresponding prediction vector:

\[ \hat{\mathbf{y}}=\mathbf{X}\hat{\boldsymbol{\beta}}\]
Properties of LSE are similar to simple:

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\] \[ \mathbf{V}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\]

summary(regressor)$coefficients[,2]

## (Intercept)          TV       Radio   Newspaper 
## 0.307501165 0.001375188 0.008489563 0.005788056

With the unbiased estimator of \(\sigma^2\)

\[ s^2 = \frac{||\mathbf{y}-\hat{\mathbf{y}}||}{n-p-1} \]

(stdev <- sqrt(sum(regressor$residuals^2))/
    (regressor$df.residual))

## [1] 0.1186925

With the distribution of \(\beta_j\) used for inferernce:

\[ \frac{\hat{\beta}_j-\beta_j}{SE(\hat{\beta}_j)}\sim t_{n-p-1}\]

F Test tests the hypothesis:

\[ H_0 : \beta_0 = ... = \beta_p = 0\] \[ H_a: \text{ atleast one } \beta_j \text{ is not 0}\] \[ F = \frac{(SST-SSE)/p}{SSE/(n-p-1)}\sim F_{p.n-p-1}\]
- In the case we want to test a subset of predictors \(q\), we fit a model that uses all variables except the \(q\) predictors, with corresponding \(RSS_0\)
\[ H_0: \beta_{p-q+1} = ... = \beta_p = 0\] \[ F = \frac{(RSS_0-RSS)/q}{RSS/(n-p-1)}\]
To see the results of many of these hypothesis tests, as well as other statistics, we can use the summary() command:

summary(regressor)

## 
## Call:
## lm(formula = Sales ~ ., data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3034 -0.8244 -0.0008  0.8976  3.7473 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.6251241  0.3075012  15.041   <2e-16 ***
## TV          0.0544458  0.0013752  39.592   <2e-16 ***
## Radio       0.1070012  0.0084896  12.604   <2e-16 ***
## Newspaper   0.0003357  0.0057881   0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.662 on 196 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.9011 
## F-statistic: 605.4 on 3 and 196 DF,  p-value: < 2.2e-16

Possible Extensions/Considerations to the Linear Model:

Qualitative Predictors

Qualitative predictors have k levels, which are unordered. To handle this we create k-1 dummy variables (for a factor with k levels) that act as binary variables.
A simple way to consider the factor model (with 2 different factors) is the following:
- \(\mu\) as the grand mean
- \(\alpha_i\) is the effect for the ith factor 1 group
- \(\beta_j\) is the effect for the jth factor 2 group
\[y_{ijk}=\mu + \alpha_i + \beta_j +\epsilon_{ijk}\]
We can look at the 2 factor example with the planes data set, with 2 factors (Paper and Plane) and response Distance.

planes = read.table('planes.txt', header = T)[c(1,2,4)]
planes$Paper = as.factor(planes$Paper)
planes$Plane = as.factor(planes$Plane)

We can look at the different levels for the two factors and plot of global mean:

plot.design(planes)

To fit a model that consists of only factors, the function aov() is preferred:

aov.planes <- aov(Distance~., data = planes)
summary(aov.planes)

##             Df   Sum Sq Mean Sq F value Pr(>F)
## Paper        1  1718721 1718721   0.678  0.425
## Plane        1   385641  385641   0.152  0.703
## Residuals   13 32947925 2534456

A simple model for an interaction between two factors:

\[ y_{ijk}=\mu+\alpha_i+\beta_j+\underbrace{(\alpha\beta)_{ij}}_{\text{Interaction}}+\epsilon_{ijk}\] + In R interaction effects are represented by including products of the dummy variables for each of the factors as additional predictors * We can explore interactions between factors in an interaction plot: + Parallel Lines are indicative of no interaction

interaction.plot(planes$Paper,planes$Plane,planes$Distance)

To fit a model with interaction terms we use the : operator

planes.aov <- aov(Distance ~ Plane:Paper, data = planes)
summary(planes.aov)

##             Df   Sum Sq Mean Sq F value  Pr(>F)   
## Plane:Paper  3 25491258 8497086   10.66 0.00106 **
## Residuals   12  9561029  796752                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Removing the Additive Assumption

One of the most important assumptions in the linear regression is that the relationship between the predictors and response are additive and linear.
- The additive assumption means that the effect of changes in a predictor \(X_j\) on the response \(Y\) is independent of other predictors.
One way to relax this assumption is to allow for interaction between predictors. For example:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon\]
- Which can be interpreted as the following: \[ Y = \beta_0 + (\beta_1 + \beta_3X_2)X_1+\beta_2X_2+\epsilon\]
If the p-value for the interaction term is significant, indicates that it is highly possible that the true relationship is not additive.
The heirachal principle states that if we want to include an interaction term, we should still include the original predictors, even if they have low signifiance.

Possible Issues with a Linear Model:

A lot of these issues can be found through inspection of the diagnostic plots:

par(mfrow = c(2,2))
plot(regressor)

Non Linearity of response-predictor relationship

Linear regression assumes there is a straight-line relationship between the \(\mathbf{X}\) and \(\mathbf{y}\).
Residual plots are useful for identifying non-linearity.
- For a simple linear model we can plot the residuals \(y_i-\hat{y}_i\) against the predictor \(x_i\)
- For higher dimension models the residuals \(y_i-\hat{y}_i\) are plotted against the fitted values \(\hat{y}_i\)
If these plots exhibit a pattern, it is a strong indiciation of non-linearity in the data. In the following, a quadratic fit exhibits lower residuals, indicating the fit is closer to quadratic than linear.

Residual Plot
- If there is indication of a non-linear association, a simple approach is to use a non-linear transformation of the predictors, such as \(\log(X)\).
One other alternative to get further information is to see if the predictor/response relationship of each variable is linear, using the pairs() function.

Correlation of Error Terms

An important assumption of the linear regression model is that error terms are correlated, i.e \(\mathbb{C}ov(\epsilon_i,\epsilon_j|X=\textbf{x})=0\).
Correlation of errors can occur as a result of correlated datapoints or in a time series (or any experiment where data is taken sequentially) where the timing of the experiment influences the error. In this case there would be noticable tracking in the residuals over time:
- Correlation between error terms is especially prevelent in time series data

Tracking residuals

Correlated error terms results in lower estimated standard errors and therefore overconfidence in the model.

Non-constant Variance (heteroscedasticity) of error terms

Another important assumption of the linear regression model is the constant variance of error terms, i.e \(\mathbb{V}(\epsilon_i)=\sigma^2\)
We can indentify non-constant variance in errors in residual plots:
- A common problem is the magnitude of residuals increasing with fitted values, resulting in the funnel shape seen in the residual plot.
- One possible solution is to transform the response using a concave function, such as \(log(Y)\) or \(\sqrt(Y)\), in order to have a greater order of shrinkage of the larger responses.
Decreasing variance

Outliers

An outlier is a point for which \(y_i\) is far from the value predicted by the model.
Outliers can be found through inspection of the residual plot:

Outlier

A commonly used measure for leverage is the hat matrix, which is utilised in these formulas for standardisation:

\[ \mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\]

lm.influence(regressor)$hat[1:5]

##          1          2          3          4          5 
## 0.02520285 0.01941823 0.03922616 0.01660967 0.02350883

#function also includes the leave one out betas and sigmas

In the case of simple linear regression:

\[ h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum^n_{i'=1}(x_{i'}-\bar{x})^2}\]
We can utilise different types of residuals to determine what is truly a ‘outlier’:
Raw residuals

\[ e = y - X\hat{\beta} \]

regressor$residuals[1:5]

##          1          2          3          4          5 
##  0.8790279 -0.8682477  1.5037910 -0.8124465  2.2558631

Standardised residuals are expected for fall in the range \((-3,3)\)

\[ r_i = \frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}} \]

rstandard(regressor)[1:5]

##          1          2          3          4          5 
##  0.5357895 -0.5276554  0.9232633 -0.4930381  1.3738118

Studentised residuals can be used for formal testing for outliers as we know the distribution

\[ t_i = \frac{e_i}{\hat{\sigma}_{-i}\sqrt{1-h_{ii}}}\sim t_{n-p-1}\]

rstudent(regressor)[1:5]

##          1          2          3          4          5 
##  0.5348127 -0.5266819  0.9229141 -0.4920840  1.3769483

Outliers don’t tend to have a major impact on the fit of the least squares model, but rather increases the RSE, resulting in higher standard errors (and therefore wider CI) and \(R^2\) which damages the interpretability of the fit.

High leverage points

High leverage points are datapoints with an unusual value for \(x_i\)
High leverage points can be inspected through leverage plots that utilise statistics such as Cook’s distance:
- The rule of thumb is to further investigate if \(D_i\) is greater than \(F_{0.5;p,n-p}\)
\[ D_i=\frac{(\hat{y}_{-i}-\hat{y})^\top(\hat{y}_{-i}-\hat{y})}{p\hat{\sigma}^2} = \frac{(b_{-i}-b)^\top(\mathbf{X}^\top\mathbf{X})(b_{-i}-b)}{p\hat{\sigma}^2}\]
- In R we can find cook’s distance:

cooks.distance(regressor)[1:5]

##           1           2           3           4           5 
## 0.001855512 0.001378372 0.008700530 0.001026446 0.011359448

In the plots we can see that in figure 1,3 the datapoint 41 has extreme leverage and looks visibly off in the fit plot. Figure 2 shows a different scenario in the case of 2 predictors where the combination is strange despite each predictor being fine by themselves:

Leverage Diagnostic
- A rule of thumb is if a datapoint has a leverage \(h_{ii}>2(p+1)/n\) is considered having high leverage as \((p+1)/n\) is always the average leverage of a datapoint.
High leverage points impact the least squares fit heavily and may invalidate the entire fit.

Collinearity

Collinearity refers to the situation where two ore more predictor variables are closely related to one another.
Between two variables, we can compute the correlation or plot the predictors against eachother to determine the collinearity, often with the pairs() command. Here is a single example:

Not collinear v Collinear

It is possible for collinearity to exist between three or more variables even if no pair exhibits high collinearity. This is referred to as multicollinearity.
- We therefore fit the VIF (variance inflation factor), which is the ratio of the variance of \(\beta_j\) when fitting the full model divdided by the variance when fitted on its own.
  - \(R^2_{X_j|X_{-j}}\) is from the regression of \(X_j\) onto all other predictors
  - The smallest possible value is 1
  - Values above 5 often indicates a problematic amount of collinearity.
\[ VIF(\hat{\beta}_j)=\frac{1}{1-R^2_{X_j|X_{-j}}},\]

library(car)

## Warning: package 'car' was built under R version 3.6.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.6.3

vif(regressor)

##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187

If collinearity is present, there are a few methods to handle it
- You can drop one of the problematic variables from the regression
- Combining the collinear variables together into a single predictor, such as taking a standardised average of two predictors.
Collinearity/Multicollinarity can pose problems as it can be difficult to seperate out the individual effects of collinear variables on the response. They can also impact the accuracy of the estimates of the regression coefficients, resulting in an increasing standard error for \(\beta_j\)
- The power of the hypothesis test suffers, the probability of correctly detecting a non-zero coefficient, as the p-value may be increased for a collinear predictor.

Confounding Effects

Another consideration for models is when a variable \(C\), which may be unmeasurable, influences \(X\) and \(Y\) but \(X\) does not influence \(Y\) directly.
- An example of this is \(X\) = Age, \(Y\) = prob of accident. Age does not mean you are better at driving, but age can come with experience, which influences \(Y\). As age cannot be measured, \(X\) is used as a proxy for experience, even though it isn’t age itself directly influcing Y.
In this situation if \(C\) is measurable you can add it to the model, but it is often unobservable. In these cases you need to alter the interpretation of the coefficients for the indirect effect.

KNN Regression

KNN Regression is an example of non-parametric regression, where no assumptions about the form of \(f(X)\) is made, allowing for great flexibility in fitting the model. However due to this these methods lose interpretability.
- Parametric approaches will perform better if the parametric form is close to the true form of \(f(X)\). Consider the following example where both a linear regression and KNN regression is fit to a non-linear \(f(X)\), with the MSE plotted as a function of 1/K:
KNN v Linear
For KNN we specify a \(K\) value for the number of nearest neighbours to consider (from the training set), denoted as \(\mathcal{N}\). When a new observation is observered \(x_0\), the k nearest neighbours, which constitute \(\mathcal{N}_0\), are averaged to create a prediction \(\hat{f}(x_0)\):

\[ \hat{f}(x_0) = \frac{1}{K}\sum_{x_i\in\mathcal{N}_0}y_i\]

When fitting this kind of regression, we need to consider which \(K\) to use. This decision comes down to the bias/variance tradeoff.
- A small value of \(K\) provides the most flexible fit, which has low bias but high variance
- A high value of \(K\) provides a more smoother and less variable fit, but this results in higher bias.
- Here is an example of \(K = 1\) v \(K=9\)
K=1 v K=9
While KNN may be an attractive approach due to its non parametric nature, it tends to perform worse than linear regression and other parametric methods in higher dimensions. This is because in higher dimensions there is effectively a reduction in sample size and it is harder to find nearby neighbours. This is called the curse of dimensionality.
- As a general rule rule parametric methods tend to outperform non-parametric methods when there is a small number of observations per predictor.

Linear Regression

Jake

31/05/2021

Simple Linear Regression

Partioning the variabilitiy

F Test

Mean Response

Individual Response

Plotting with CI and PI

Multiple Linear Regression

Possible Extensions/Considerations to the Linear Model:

Qualitative Predictors

Removing the Additive Assumption

Possible Issues with a Linear Model:

Non Linearity of response-predictor relationship

Correlation of Error Terms

Non-constant Variance (heteroscedasticity) of error terms

Outliers

High leverage points

Collinearity

Confounding Effects

KNN Regression