Statistical Learning: Unit 3 - Linear Regression

Linear regression

Linear regression is a simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1, X_2, ..., X_p$ is linear.
True regression functions are never linear!

Although is may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

Linear regression for the advertising data

Consider the advertising data shown

Questions we might ask:

Is there a relationship between advertising budget and sales?
How strong is the relationship between advertising budget and sales?
Which media contribute to sales?
How accurately can we predict future sales?
Is the relationship linear?
Is there synergy among the advertising media?

Simple linear regression using predictor $X$.

We assume a model

\[Y = \beta_0 + \beta_1 X + \epsilon\]

where $\beta_0$ and $\beta_1$ are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and $\epsilon$ is the error term.

Given some estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ for the model coefficients, we predict future sales using

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

where $\hat{y}$ indicates a prediction $Y$ on the basis of $X = x$.

The hat symbol denotes an estimated value.

Estimation of the parameters by least squares

Let $\hat{y}_1 = \hat{\beta}_0 + \hat{\beta}_1 x_1$ be the prediction for $Y$ based on the ith value of $X$. Then $e_i = y_i - \hat{y}_i$ represents the ith residual
We define the residual sum of squares (RSS) as

\[RSS = e_1^2 + e_2^2 + ... + e_n^2\]

or equivalent as

\[RSS = (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1) + (y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2) + ... + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\] * The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. The minimizing values can be shown to be

\[\hat{\beta}_1 = \frac{\sum_{i = 1}^{n} (x_i - \hat{x}) (y_i - \hat{y})}{\sum_{i =1}^n (x_i - \hat{x}^2)}\]

where $\hat{y} = \frac{1}{n} \sum_{i = 1}^{n} y_i$ and $\hat{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i$ are the sample means.

Example: advertising data

The least squares fit for the regression of sales onto TV. In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.

Assessing the Accuracy of Coefficient Estimates

The standard error of an estimator reflects how it varies under repeated sampling. We have

\[SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i = 1}^{n}(x_i - \hat{x})^2}, SE(\hat{\beta}_0)^2 = \sigma^2 \left[\frac{1}{n} + \frac{\hat{x}^2}{\sum_{i = 1}^{n} (x_i - \hat{x})^2}\right]\]

where $\sigma^2 = Var(\epsilon)$
These standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form

\[ \hat{\beta}_1 + 2 * SE(\hat{\beta}_1)\]

That is, there is approximately a 95 chance that the interval

\[ \left[\hat{\beta}_1 - 2 * SE(\hat{\beta}_1).\hat{\beta}_1 + 2 * SE(\hat{\beta}_1)\right]\]

will contain the true value of $\beta_1$ (under a scenario where we got repeated samples like the present sample)

For the advertising data, the 95% confidence interval for $\beta_1$ is $\left[0.042, 0.053\right]$

Hypothesis testing

Standard errors can be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of

$H_0$ - There is no relationship between $X$ and $Y$ versus the alternative hypothesis

$H_A$ - There is some relationship between $X$ and $Y$

Mathematically, this corresponds to testing

\[H_0: \beta_1 = 0\]

versus

\[H_A: \beta_1 \ne 0\]

since if $\beta_1 = 0) then the model reduces to \(Y = \beta_0 + \epsilon$ and $X$ is not associated with $Y$

To test the null hypothesis, we compute a t-statistic, given by

\[t= \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1}\] * This will have a t-distribution with $n - 2$ degrees of freedom, assuming $\beta_1 = 0$

Using statistical software, it is easy to compute the probability of observing any value equal to $|t|$ or larger. We call this probability the p-value.

Results for the advertising data

Assessing the Overall Accuracy of the Model

We compute the Residual Standard Error

\[RSE = \sqrt {\frac{1}{n - 2}RSS} = \sqrt{\frac{1}{n - 2} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}\]

where the residual sum of squares is $RSS = \sum_{i = 1}^n (y_i - \hat{y}_i)^2$.

R-squared or fraction of variance explained is

\[R^2 = \frac{RSS - TSS}{TSS} = 1 - \frac{RSS}{TSS}\] where $TSS = \sum_{i = 1}^n (y_i - \hat{y})^2$ is the total sum of squares

It can be shown that in this simple linear regression setting that $R^2 = r^2$, where $r$ is the correlation between $X$ and $Y$:

\[ r = \frac{\sum_{i =1}^n (x_i - \hat{x})(y_i - \hat{y})}{\sqrt{\sum_{i = 1}^n (x_i - \hat{x}) \sqrt{\sum_{i = 1}^n (y_i - \hat{y})^2 }}}\]

Advertising data results

Multiple Linear Regression

Here our model is

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon\] * We interpret $\beta_1) as the **average** effect on \(Y$ of a one unit increase in $X_j$, holding all other predictors fixed. In the advertising example, the model becomes

\[ sales = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * newspaper + \epsilon\]

Interpreting regression coefficients

The ideal scenario is when the predictors are uncorrelated - a balanced design
- Each coefficient can be estimated and tested separately.
- Interpretations such as “a unit change in $X_j$ is associated with a $\beta_j$ change in $Y$, while all the other variables stay fixed”, are possible
Correlations amongst predictors cause problems:
- The variance of all coefficients tends to increase sometimes dramatically
- Interpretations become hazardous – when X_j changes, everything else changes
Claims of causality should be avoided for observational data.

The woes of (interpreting) regression coefficients

“Data Analysis and Regression” Mosteller and Tukey 1977

a regression coefficient $\beta_j$ estimates the expected change in $Y$ per unit change in $X_j$, with all other predictors held fixed. But predictors usually change together!
Example: $Y$ total amount of change in your pocket; $X_1 = $ # of coins; $X_2 = $ # of pennies, nickels and dimes. By itself, regression coefficient of $Y$ on $X_2) will be \(> 0$. But how about with $X_1$ in model?
$Y = $ number of tackles by a football player in a season; $W$ and $H$ are his weight and height. Fitted regression model is $\hat{Y} = b_0 + .50W - .10H$. How do we interpret $\hat{\beta}_2 < 0$?

Estimation and Prediction for Multiple Regression

Given estimates $\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_p$, we can make predictions using the formula

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta_2} x_2 + ... + \hat{\beta}_p x_p\]

We estimate $\beta_0, \beta_1, ..., \beta_p $ as the values that minimize the sum of squared residuals

\[ RSS = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\] \[ RSS = \sum_{i = 1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - ... - \hat{\beta}_p x_{ip})^2\]

This is done using standard statistical software. The values $\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_p$ that minimize RSS are the multiple least squares regression coefficient estimates.

Results for advertising data

Some important questions

Is at least one of predictors $X_1, X_2, ..., X_p$ useful in predicting the response?
Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Is at least one predictor useful

For the first question, we can use the F-statistic

\[ F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n-p-1}} \approx F_{p,n -p - 1}\]

Deciding on the important variables

The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.
However we often can’t examine all possible models, since they are $2^p) of them; for example \(p = 4$ there are over a billion models!

Instead we need an automated approach that searches through a subset of them. We discuss two commonly use approaches next.

Forward selection

Begin with the null model – a model that contains an intercept but no predictors
Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS
Add to that model the variable results in the lowest RSS amongsts all two-variables models.
Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value some threshold.

Backward selection

Start with all variables in the model
Remove the variable with the largest p-value – that is, the variable that is the least statistically significant.
The new (p - 1) - variable model is fit, and the variable with the largest p-value is removed.
Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value defined by some significance threshold.

Other Considerations in the Regression Model

Qualitative Predictors
- Some predictors are not quantitative but are qualitative, taking a discrete set of values
- These are also called categorical predictors or factor variables.

Qualitative Predictors

Example: investigate differences in credit card balance between males and females - ignoring the other variables. We create a new variable

\[x_i = \cases{1 & if ith person is female \\ 0 & if ith person is male} \]

Resulting model:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i = \cases{\beta_0 + \beta_1 + \epsilon_i & if ith person is female \\ \beta_0 + \epsilon_i & if ith person is male}\]

With more than two levels, we create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. The first could be

\[ x_{i1} = \cases{1 & if ith person is Asian \\ 0 & if ith person is not Asian}\] and the second could be

\[x_{i2} = \cases{1 & if ith person is Caucasian \\ 0 & if ith person is not Caucasian}\]

Then both of these variables can be used in the regression equation, in order to obtain the model

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i = \cases{\beta_0 + \beta_1 + \epsilon_i & if ith person is Asian \\ \beta_0 + \beta_2 + \epsilon_i & if ith person is Caucasian \\ \beta_0 + \epsilon_i & if ith person is AA}\] * There will always be one fewer dummy variable the the number of levels. The level with no dummy variable – African American in this example – is known as the baseline.

Results for ethnicity

Extensions of the Linear Model

Removing the additive assumption: interactions and nonlinearity
Interactions
- In our previous analysis of the Advertising data, we assumed that the effect of sales of increasing one advertising medium is independent of the amount spent on the other media.
- For example, the linear model
\[ \hat{sales} = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * newspaper\] states that the average effect on sales of a one-unit increase in TV is always $\beta_1$, regardless of the amount spent on radio.
But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.
In this situation, given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.
In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.

Interaction in the Advertising data?

When levels of either TV or radio are low, then the true sales are lower than predicted by linear model.

But when advertising is split between the two media, then the model tends to underestimate sales.

Modelling interactions – Advertising data

Model takes the form

\[ sales = \beta_0 + \beta_1 * TV + \beta_2 * radio + \beta_3 * (radio * TV) + \epsilon = \beta_0 + (\beta_1 + \beta_3 * radio)*TV + \beta_2 * radio + \epsilon\] Results

Interpretation

The results in this table suggest that interactions are important
The p-value for the interaction term TV x radio is extremely low, indicating that there is strong evidence for $H_A : \beta_3 \ne 0$
The $R^2$ for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.

Hierarchy

Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects
The hierarchy principle: If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
The rationale for this principle is that interactions are hard to interpret in a model without main effects –their meaning is changed
Specifically, the interaction terms also contain main effects, if the model has no main effect terms

Interactions between qualitative and quantitative variables

Consider the Credit data set, and suppose that we wish to predict balance using income (quantitative) and student (qualitative)

Without an interaction term, the model takes the form.

\[ balance \approx \beta_0 + \beta_1 * income + \cases{\beta_2 & if ith person is a student \\ 0 & if ith person is not a student} = \beta_1 * income + \cases{\beta_0 + \beta_2 & if ith person is a student \\ \beta_0 & if ith person is not a student}\]

With interactions, it takes the form

\[ balance \approx \beta_0 + \beta_1 * income + \cases{\beta_2 + \beta_3*income & if ith person is a student \\ 0 & if ith person is not a student} = \cases{(\beta_0 + \beta_2) + (\beta_1 + \beta_3) * income & if ith person is a student \\ \beta_0 + \beta_1 * income & if ith person is not a student}\]

Non-linear effects of predictors

the figure suggests that

\[ mpg = \beta_0 + \beta_1 * horsepower + \beta_2 * horsepower^2 + \epsilon\]

may provide a better fit.

Linear Regression in R

### Libraries needed for linear regression
library(MASS)
library(ISLR)
### Simple linear regression
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
?Boston
plot(medv~lstat,Boston)
fit1=lm(medv~lstat,data=Boston)
fit1
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
summary(fit1)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
abline(fit1,col="red")

names(fit1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"
confint(fit1)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
predict(fit1,data.frame(lstat=c(5,10,15)),interval="confidence")
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
### Multiple linear regression
fit2=lm(medv~lstat+age,data=Boston)
summary(fit2)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
fit3=lm(medv~.,Boston)
summary(fit3)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)  36.4594884   5.1034588   7.144 3.28e-12 ***
## crim         -0.1080114   0.0328650  -3.287 0.001087 ** 
## zn            0.0464205   0.0137275   3.382 0.000778 ***
## indus         0.0205586   0.0614957   0.334 0.738288    
## chas          2.6867338   0.8615798   3.118 0.001925 ** 
## nox         -17.7666112   3.8197437  -4.651 4.25e-06 ***
## rm            3.8098652   0.4179253   9.116  < 2e-16 ***
## age           0.0006922   0.0132098   0.052 0.958229    
## dis          -1.4755668   0.1994547  -7.398 6.01e-13 ***
## rad           0.3060495   0.0663464   4.613 5.07e-06 ***
## tax          -0.0123346   0.0037605  -3.280 0.001112 ** 
## ptratio      -0.9527472   0.1308268  -7.283 1.31e-12 ***
## black         0.0093117   0.0026860   3.467 0.000573 ***
## lstat        -0.5247584   0.0507153 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit3)

fit4=update(fit3,~.-age-indus)
summary(fit4)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16
### Nonlinear terms and Interactions
fit5=lm(medv~lstat*age,Boston)
summary(fit5)
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16
fit6=lm(medv~lstat +I(lstat^2),Boston); summary(fit6)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
attach(Boston)
par(mfrow=c(1,1))
plot(medv~lstat)
points(lstat,fitted(fit6),col="red",pch=20)
fit7=lm(medv~poly(lstat,4))
points(lstat,fitted(fit7),col="blue",pch=20)

plot(1:20,1:20,pch=1:20,cex=2)

###Qualitative predictors
fix(Carseats)
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population"  "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"       "US"
summary(Carseats)
##      Sales          CompPrice       Income        Advertising       Population        Price        ShelveLoc        Age          Education    Urban       US     
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0   No :118   No :142  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000   1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0   Yes:282   Yes:258  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000   Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0                      
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635   Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9                      
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000   3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0                      
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000   Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0
fit1=lm(Sales~.+Income:Advertising+Age:Price,Carseats)
summary(fit1)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1
###Writing R functions
regplot=function(x,y){
  fit=lm(y~x)
  plot(x,y)
  abline(fit,col="red")
}
attach(Carseats)
regplot(Price,Sales)

regplot=function(x,y,...){
  fit=lm(y~x)
  plot(x,y,...)
  abline(fit,col="red")
}
regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20)

Statistical Learning: Unit 3 - Linear Regression

Sulman Khan

October 26, 2018

Linear regression

Linear regression

Linear regression for the advertising data

Simple linear regression using predictor \(X\).

Estimation of the parameters by least squares

Example: advertising data

Assessing the Accuracy of Coefficient Estimates

Hypothesis testing

Results for the advertising data

Assessing the Overall Accuracy of the Model

Advertising data results

Multiple Linear Regression

Interpreting regression coefficients

The woes of (interpreting) regression coefficients

Estimation and Prediction for Multiple Regression

Results for advertising data

Some important questions

Is at least one predictor useful

Deciding on the important variables

Forward selection

Backward selection

Other Considerations in the Regression Model

Qualitative Predictors

Results for ethnicity

Extensions of the Linear Model

Interaction in the Advertising data?

Modelling interactions – Advertising data

Interpretation

Hierarchy

Interactions between qualitative and quantitative variables

Non-linear effects of predictors

Linear Regression in R