Linear regression is a simple approach to supervised learning. It assumes that the dependence of \(Y\) on \(X_1, X_2, ..., X_p\) is linear.
True regression functions are never linear!
Consider the advertising data shown
Questions we might ask:
Is there a relationship between advertising budget and sales?
How strong is the relationship between advertising budget and sales?
Which media contribute to sales?
How accurately can we predict future sales?
Is the relationship linear?
Is there synergy among the advertising media?
\[Y = \beta_0 + \beta_1 X + \epsilon\]
where \(\beta_0\) and \(\beta_1\) are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and \(\epsilon\) is the error term.
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]
where \(\hat{y}\) indicates a prediction \(Y\) on the basis of \(X = x\).
The hat symbol denotes an estimated value.
Let \(\hat{y}_1 = \hat{\beta}_0 + \hat{\beta}_1 x_1\) be the prediction for \(Y\) based on the ith value of \(X\). Then \(e_i = y_i - \hat{y}_i\) represents the ith residual
We define the residual sum of squares (RSS) as
\[RSS = e_1^2 + e_2^2 + ... + e_n^2\]
\[RSS = (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1) + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2) + ... + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\] * The least squares approach chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS. The minimizing values can be shown to be
\[\hat{\beta}_1 = \frac{\sum_{i = 1}^{n} (x_i - \hat{x}) (y_i - \hat{y})}{\sum_{i =1}^n (x_i - \hat{x}^2)}\]
where \(\hat{y} = \frac{1}{n} \sum_{i = 1}^{n} y_i\) and \(\hat{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i\) are the sample means.
The least squares fit for the regression of sales onto TV. In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.
\[SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i = 1}^{n}(x_i - \hat{x})^2}, SE(\hat{\beta}_0)^2 = \sigma^2 \left[\frac{1}{n} + \frac{\hat{x}^2}{\sum_{i = 1}^{n} (x_i - \hat{x})^2}\right]\]
where \(\sigma^2 = Var(\epsilon)\)
These standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form
\[ \hat{\beta}_1 + 2 * SE(\hat{\beta}_1)\]
That is, there is approximately a 95 chance that the interval
\[ \left[\hat{\beta}_1 - 2 * SE(\hat{\beta}_1).\hat{\beta}_1 + 2 * SE(\hat{\beta}_1)\right]\]
will contain the true value of \(\beta_1\) (under a scenario where we got repeated samples like the present sample)
For the advertising data, the 95% confidence interval for \(\beta_1\) is \(\left[0.042, 0.053\right]\)
\(H_0\) - There is no relationship between \(X\) and \(Y\) versus the alternative hypothesis
\(H_A\) - There is some relationship between \(X\) and \(Y\)
\[H_0: \beta_1 = 0\]
versus
\[H_A: \beta_1 \ne 0\]
since if \(\beta_1 = 0) then the model reduces to \(Y = \beta_0 + \epsilon\) and \(X\) is not associated with \(Y\)
\[t= \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1}\] * This will have a t-distribution with \(n - 2\) degrees of freedom, assuming \(\beta_1 = 0\)
\[RSE = \sqrt {\frac{1}{n - 2}RSS} = \sqrt{\frac{1}{n - 2} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}\]
where the residual sum of squares is \(RSS = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\).
\[R^2 = \frac{RSS - TSS}{TSS} = 1 - \frac{RSS}{TSS}\] where \(TSS = \sum_{i = 1}^n (y_i - \hat{y})^2\) is the total sum of squares
\[ r = \frac{\sum_{i =1}^n (x_i - \hat{x})(y_i - \hat{y})}{\sqrt{\sum_{i = 1}^n (x_i - \hat{x}) \sqrt{\sum_{i = 1}^n (y_i - \hat{y})^2 }}}\]
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon\] * We interpret \(\beta_1) as the **average** effect on \(Y\) of a one unit increase in \(X_j\), holding all other predictors fixed. In the advertising example, the model becomes
\[ sales = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * newspaper + \epsilon\]
“Data Analysis and Regression” Mosteller and Tukey 1977
a regression coefficient \(\beta_j\) estimates the expected change in \(Y\) per unit change in \(X_j\), with all other predictors held fixed. But predictors usually change together!
Example: \(Y\) total amount of change in your pocket; \(X_1 = \) # of coins; \(X_2 = \) # of pennies, nickels and dimes. By itself, regression coefficient of \(Y\) on \(X_2) will be \(> 0\). But how about with \(X_1\) in model?
\(Y = \) number of tackles by a football player in a season; \(W\) and \(H\) are his weight and height. Fitted regression model is \(\hat{Y} = b_0 + .50W - .10H\). How do we interpret \(\hat{\beta}_2 < 0\)?
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta_2} x_2 + ... + \hat{\beta}_p x_p\]
\[ RSS = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\] \[ RSS = \sum_{i = 1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - ... - \hat{\beta}_p x_{ip})^2\]
This is done using standard statistical software. The values \(\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_p\) that minimize RSS are the multiple least squares regression coefficient estimates.
Is at least one of predictors \(X_1, X_2, ..., X_p\) useful in predicting the response?
Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
For the first question, we can use the F-statistic
\[ F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-1}} \approx F_{p,n -p - 1}\]
The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.
However we often can’t examine all possible models, since they are \(2^p) of them; for example \(p = 4\) there are over a billion models!
Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next.
Begin with the null model – a model that contains an intercept but no predictors
Fit \(p\) simple linear regressions and add to the null model the variable that results in the lowest RSS
Add to that model the variable results in the lowest RSS amongsts all two-variables models.
Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value some threshold.
Start with all variables in the model
Remove the variable with the largest p-value – that is, the variable that is the least statistically significant.
The new (p - 1) - variable model is fit, and the variable with the largest p-value is removed.
Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value defined by some significance threshold.
Example: investigate differences in credit card balance between males and females - ignoring the other variables. We create a new variable
\[x_i = \cases{1 & if ith person is female \\ 0 & if ith person is male} \]
Resulting model:
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i = \cases{\beta_0 + \beta_1 + \epsilon_i & if ith person is female \\ \beta_0 + \epsilon_i & if ith person is male}\]
\[ x_{i1} = \cases{1 & if ith person is Asian \\ 0 & if ith person is not Asian}\] and the second could be
\[x_{i2} = \cases{1 & if ith person is Caucasian \\ 0 & if ith person is not Caucasian}\]
\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i = \cases{\beta_0 + \beta_1 + \epsilon_i & if ith person is Asian \\ \beta_0 + \beta_2 + \epsilon_i & if ith person is Caucasian \\ \beta_0 + \epsilon_i & if ith person is AA}\] * There will always be one fewer dummy variable the the number of levels. The level with no dummy variable – African American in this example – is known as the baseline.
Removing the additive assumption: interactions and nonlinearity
\[ \hat{sales} = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * newspaper\] states that the average effect on sales of a one-unit increase in TV is always \(\beta_1\), regardless of the amount spent on radio.
But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.
In this situation, given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.
In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.
When levels of either TV or radio are low, then the true sales are lower than predicted by linear model.
But when advertising is split between the two media, then the model tends to underestimate sales.
Model takes the form
\[ sales = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * (radio * TV) + \epsilon = \beta_0 + (\beta_1 + \beta_3 * radio)*TV + \beta_2 * radio + \epsilon\] Results
The results in this table suggest that interactions are important
The p-value for the interaction term TV x radio is extremely low, indicating that there is strong evidence for \(H_A : \beta_3 \ne 0\)
The \(R^2\) for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.
Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects
The hierarchy principle: If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
The rationale for this principle is that interactions are hard to interpret in a model without main effects –their meaning is changed
Specifically, the interaction terms also contain main effects, if the model has no main effect terms
Consider the Credit data set, and suppose that we wish to predict balance using income (quantitative) and student (qualitative)
Without an interaction term, the model takes the form.
\[ balance \approx \beta_0 + \beta_1 * income + \cases{\beta_2 & if ith person is a student \\ 0 & if ith person is not a student} = \beta_1 * income + \cases{\beta_0 + \beta_2 & if ith person is a student \\ \beta_0 & if ith person is not a student}\]
With interactions, it takes the form
\[ balance \approx \beta_0 + \beta_1 * income + \cases{\beta_2 + \beta_3*income & if ith person is a student \\ 0 & if ith person is not a student} = \cases{(\beta_0 + \beta_2) + (\beta_1 + \beta_3) * income & if ith person is a student \\ \beta_0 + \beta_1 * income & if ith person is not a student}\]
the figure suggests that
\[ mpg = \beta_0 + \beta_1 * horsepower + \beta_2 * horsepower^2 + \epsilon\]
may provide a better fit.
### Libraries needed for linear regression
library(MASS)
library(ISLR)
### Simple linear regression
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
?Boston
plot(medv~lstat,Boston)
fit1=lm(medv~lstat,data=Boston)
fit1
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(fit1)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
abline(fit1,col="red")names(fit1)
## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model"
confint(fit1)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
predict(fit1,data.frame(lstat=c(5,10,15)),interval="confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
### Multiple linear regression
fit2=lm(medv~lstat+age,data=Boston)
summary(fit2)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
fit3=lm(medv~.,Boston)
summary(fit3)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.4594884 5.1034588 7.144 3.28e-12 ***
## crim -0.1080114 0.0328650 -3.287 0.001087 **
## zn 0.0464205 0.0137275 3.382 0.000778 ***
## indus 0.0205586 0.0614957 0.334 0.738288
## chas 2.6867338 0.8615798 3.118 0.001925 **
## nox -17.7666112 3.8197437 -4.651 4.25e-06 ***
## rm 3.8098652 0.4179253 9.116 < 2e-16 ***
## age 0.0006922 0.0132098 0.052 0.958229
## dis -1.4755668 0.1994547 -7.398 6.01e-13 ***
## rad 0.3060495 0.0663464 4.613 5.07e-06 ***
## tax -0.0123346 0.0037605 -3.280 0.001112 **
## ptratio -0.9527472 0.1308268 -7.283 1.31e-12 ***
## black 0.0093117 0.0026860 3.467 0.000573 ***
## lstat -0.5247584 0.0507153 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit3)fit4=update(fit3,~.-age-indus)
summary(fit4)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
### Nonlinear terms and Interactions
fit5=lm(medv~lstat*age,Boston)
summary(fit5)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
fit6=lm(medv~lstat +I(lstat^2),Boston); summary(fit6)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
attach(Boston)
par(mfrow=c(1,1))
plot(medv~lstat)
points(lstat,fitted(fit6),col="red",pch=20)
fit7=lm(medv~poly(lstat,4))
points(lstat,fitted(fit7),col="blue",pch=20)plot(1:20,1:20,pch=1:20,cex=2)###Qualitative predictors
fix(Carseats)
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population" "Price" "ShelveLoc" "Age" "Education" "Urban" "US"
summary(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0 No :118 No :142
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0 Yes:282 Yes:258
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000 Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635 Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000 Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
fit1=lm(Sales~.+Income:Advertising+Age:Price,Carseats)
summary(fit1)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
###Writing R functions
regplot=function(x,y){
fit=lm(y~x)
plot(x,y)
abline(fit,col="red")
}
attach(Carseats)
regplot(Price,Sales)regplot=function(x,y,...){
fit=lm(y~x)
plot(x,y,...)
abline(fit,col="red")
}
regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20)