Linear regression, a very simple approach for supervised learning. A useful tool for predicting a quantitative response.

3.1 Simple Linear Regression

Regression Equation:

YB0 + B1(X)

B0 and B1 represent the intercept and slope terms in the linear model. They are the model coefficients or parameters.

3.1.1 Estimating the Coefficients

Fitting a line is typically done by minimizing the least squares criterion.

Residual– is the difference between the ith observed response value and the ith response value that is predicted by our linear model.

Thus, the Residual Sum of Squares (RSS) is simply the sum of the squared residuals.

RSS = e21 + e22 + … + e2n

3.1.2 Assessing the Accuracy of the Coefficient Estimates

Approximating f as a linear function, we can write this relationship as:

Y = B0 + B1(X) + E

Where: B0 is the intercept term–the expected value of Y when X = 0.

and

B1 is the slope–the average increase in Y associated with a one-unit increase in X.

The error is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y, and there may be measurement error.

"An unbiased estimator does not systematically over- or under-estimate the true parameter.

The standard error tells us approximately how much our estimate of a given parameter, mean, etc.is an over/under estimation.

It is equal to the standard deviation divided by the sample size.

Thus, the simplest and most effective way to reduce the standard error is to increase the sample size.

Standard errors are used to compute confidence intervals.

P-value:

“Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response.”

3.1.3 Assessing the Accuracy of the Model

“How well does the model fit the data?”

"The quality of a linear regression fit is typically assessed using two related quantities:

  1. Residual Standard Error
  2. R2 Statistic

Residual Standard Error

Is an estimate of the standard deviation of E. Roughly speaking, it is the average amount that the response will deviate from the true regression line.

It is a measure of lack of fit of the model to the data.

R2 Statistic

The R2 statistic takes the form of a proportion–the proportion of variance explained.

R2 = 1 - (RSS/TSS)

3.2 Multiple Linear Regression

“Extends the simple linear regression model so that it can directly accomodate multiple predictors.”

The multiple linear regression model:

Y = B0 + B1X1 + B2X2 + … BpXp + E

Where Xj represents the jth predictor and Bj quantifies the association between that variable and the response. We interpret Bj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed.

3.2.1 Estimating the Regression Coefficients

3.2.2 Some Important Questions

  1. Is at least one of the predictors X1, X2, …, Xp useful in predicting the response?
  2. Do all the predictors help to explain Y, or is only a subset of the predictors useful?*
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

One: Is There a Relationship Between the Response and Predictors?

Are all the regression coefficients zero?

Ho: B1 = B2 = … = Bp = 0

Ha: at least one Bj is non-zero.

__This hypothesis test is performed by computing the F-statistic,

If there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. If Ha is true, then we expect F to be greater than 1.

Two: Deciding on Important Variables

Three Classical Approaches for model selection

Three: Model Fit

What is happening with the R2 and the RSE?

Four: Predictions

*Three sorts of uncertainty associated with predicted the response Y based on our estimated parameters.

  1. The coefficients are only estimates of the true population coefficients and thus is only an estimate of the true population regression line.

  2. There is also potentially reducible error associated with model selection referred to as, model bias.

  3. There is always random error e in the model.

3.3 Other Considerations in the Regression Model

3.3.1 Qualitative Predictors

Predictors with Only Two Levels

A dummy variable is used (a variable that takes on only two possible numerical values).

For example:

xi = 1 if ith person is female and xi = 0 if ith person is male

Resulting in the model

*yi = B0 + B1xi + Ei

Where B1 = 0 if MALE

Qualitative Predictors with More than Two Levels

“When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the ethnicity variable we create two dummy variables.”

For example:

xi1 = 1 –> if ith person is Asian and 0 if ith person is not Asian,

and the second could be

xi2 = 1 if ith person is Caucasian and 0 if ith person is not Caucasian.

The level with no dummy variable is the baseline.

3.3.2 Extensions of the Linear Model

Two important assumptions of the linear regression model:

  1. The additive assumption means that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors.

  2. The linear assumption states that the change in the response Y due to a one-unit change in Xj is constant, regardless of the value of Xj.

Removing the Additive Assumption

There may be synergistic cases, where an increase in one variable has a >1 effect on another variable (it increases its slope). Using an interaction term, allows us to incorporte this in linear models.

Non-linear relationships

You can use polynomials in linear models to better capture relationships with some curvature .

Polynomial regression

3.3.3 Potential Problems

  1. Non-linearity of the response-predictor relationships
  2. Correlation of error terms
  3. Non-constant variance of error terms
  4. Outliers
  5. High-leverage points
  6. Collinearity

1. Non-linearity of the Data

Residual plots are a useful graphical tool for identifying non-linearity.

2. Correlation of Error Terms

“If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors… In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.”

Correlations in the context of time series data frequently occur in the context of time series data.

3. Non-constant Variance of Error Terms

Heteroscedasticity: non-constant variances in the errors. A funnel shape in the residual plot is an easy way to identify its presence.

Heteroscedasticity occurs when the magnitude of the residuals tends to increase with the fitted values.

4. Outliers

An outlier is a point for which yi is far from the value predicted by the model.

Plotting the studentized residuals helps identify outliers.

-Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.

Important to remember that Outliers refer to observations for which the response yi is unusual given the predictor xi.

5. High Leverage Points

Observations with high leverage have an unusual value for xi.

6. Collinearity

Refers to the situation in which two or more predictor variables are closely related to one another.

The presence of collinearity can make it difficult to separate out the individual effects of collinear variables on the response.

The power of the hypothesis test–the probability of correctly detecting a non-zero coefficient–is reduced by collinearity.

Multicollinearity can occur even if no pair of variables has a particularly high correlation, but if collinearity exists between three or more variables.

The Variance Inflation Factor (VIF) is a better way of assessing multicollinearity.

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

Two simple solutions:

  1. Drop one of the problematic variables from the regression.

  2. Combine the collinear variables together into a single predictor.

3.4 The Marketing Plan

(Basically just a review of everything above)

3.5 Comparison of Linear Regression with K-Nearest Neighbors

K-nearest neighbors regression (KNN regression).

__3.6 Lab: Linear Regression

3.6.1 Libraries

library(MASS)

Attaching package: 㤼㸱MASS㤼㸲

The following object is masked _by_ 㤼㸱.GlobalEnv㤼㸲:

    Boston
library(ISLR)

Attaching package: 㤼㸱ISLR㤼㸲

The following objects are masked _by_ 㤼㸱.GlobalEnv㤼㸲:

    Auto, College
fix(Boston)
names(Boston)
 [1] "crim"    "zn"      "indus"   "chas"   
 [5] "nox"     "rm"      "age"     "dis"    
 [9] "rad"     "tax"     "ptratio" "black"  
[13] "lstat"   "medv"   
lm.fit=lm(medv~lstat,Boston)
attach(Boston)
lm.fit

Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  
summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value
(Intercept) 34.55384    0.56263   61.41
lstat       -0.95005    0.03873  -24.53
            Pr(>|t|)    
(Intercept)   <2e-16 ***
lstat         <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
names(lm.fit)
 [1] "coefficients"  "residuals"    
 [3] "effects"       "rank"         
 [5] "fitted.values" "assign"       
 [7] "qr"            "df.residual"  
 [9] "xlevels"       "call"         
[11] "terms"         "model"        
confint(lm.fit)
                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505
predict(lm.fit,data.frame(lstat=c(5,10,15)),interval="confidence")
       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
predict(lm.fit,data.frame(lstat=c(5,10,15)),interval="prediction")
       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846
plot(lstat,medv)
abline(lm.fit)

plot(lstat,medv)
abline(lm.fit)
abline(lm.fit,lwd=3)
abline(lm.fit,lwd=3,col="red")

plot(lstat,medv,col="red")

plot(lstat,medv,pch=20)

plot(lstat,medv,pch="+")

plot(1:20,1:20,pch=1:20)

par(mfrow=c(2,2))
plot(lm.fit)

plot(predict(lm.fit),residuals(lm.fit))

plot(predict(lm.fit),rstudent(lm.fit))

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))
375 
375 

3.6.3 Multiple Linear Regression

lm.fit=lm(medv~lstat+age,data=Boston)
summary(lm.fit)

Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value
(Intercept) 33.22276    0.73085  45.458
lstat       -1.03207    0.04819 -21.416
age          0.03454    0.01223   2.826
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
lstat        < 2e-16 ***
age          0.00491 ** 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
lm.fit=lm(medv~.,data=Boston)
summary(lm.fit)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value
(Intercept)  3.646e+01  5.103e+00   7.144
crim        -1.080e-01  3.286e-02  -3.287
zn           4.642e-02  1.373e-02   3.382
indus        2.056e-02  6.150e-02   0.334
chas         2.687e+00  8.616e-01   3.118
nox         -1.777e+01  3.820e+00  -4.651
rm           3.810e+00  4.179e-01   9.116
age          6.922e-04  1.321e-02   0.052
dis         -1.476e+00  1.995e-01  -7.398
rad          3.060e-01  6.635e-02   4.613
tax         -1.233e-02  3.760e-03  -3.280
ptratio     -9.527e-01  1.308e-01  -7.283
black        9.312e-03  2.686e-03   3.467
lstat       -5.248e-01  5.072e-02 -10.347
            Pr(>|t|)    
(Intercept) 3.28e-12 ***
crim        0.001087 ** 
zn          0.000778 ***
indus       0.738288    
chas        0.001925 ** 
nox         4.25e-06 ***
rm           < 2e-16 ***
age         0.958229    
dis         6.01e-13 ***
rad         5.07e-06 ***
tax         0.001112 ** 
ptratio     1.31e-12 ***
black       0.000573 ***
lstat        < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
lm.fit1=lm(medv~.-age,data=Boston)
summary(lm.fit1)

Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value
(Intercept)  36.436927   5.080119   7.172
crim         -0.108006   0.032832  -3.290
zn            0.046334   0.013613   3.404
indus         0.020562   0.061433   0.335
chas          2.689026   0.859598   3.128
nox         -17.713540   3.679308  -4.814
rm            3.814394   0.408480   9.338
dis          -1.478612   0.190611  -7.757
rad           0.305786   0.066089   4.627
tax          -0.012329   0.003755  -3.283
ptratio      -0.952211   0.130294  -7.308
black         0.009321   0.002678   3.481
lstat        -0.523852   0.047625 -10.999
            Pr(>|t|)    
(Intercept) 2.72e-12 ***
crim        0.001075 ** 
zn          0.000719 ***
indus       0.737989    
chas        0.001863 ** 
nox         1.97e-06 ***
rm           < 2e-16 ***
dis         5.03e-14 ***
rad         4.75e-06 ***
tax         0.001099 ** 
ptratio     1.10e-12 ***
black       0.000544 ***
lstat        < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

3.6.4 Interaction terms

summary(lm(medv~lstat*age,data=Boston))

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value
(Intercept) 36.0885359  1.4698355  24.553
lstat       -1.3921168  0.1674555  -8.313
age         -0.0007209  0.0198792  -0.036
lstat:age    0.0041560  0.0018518   2.244
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
lstat       8.78e-16 ***
age           0.9711    
lstat:age     0.0252 *  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

3.6.5 Non-linear Transformations of the Predictors

lm.fit2=lm(medv~lstat+I(lstat^2))
summary(lm.fit)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value
(Intercept)  3.646e+01  5.103e+00   7.144
crim        -1.080e-01  3.286e-02  -3.287
zn           4.642e-02  1.373e-02   3.382
indus        2.056e-02  6.150e-02   0.334
chas         2.687e+00  8.616e-01   3.118
nox         -1.777e+01  3.820e+00  -4.651
rm           3.810e+00  4.179e-01   9.116
age          6.922e-04  1.321e-02   0.052
dis         -1.476e+00  1.995e-01  -7.398
rad          3.060e-01  6.635e-02   4.613
tax         -1.233e-02  3.760e-03  -3.280
ptratio     -9.527e-01  1.308e-01  -7.283
black        9.312e-03  2.686e-03   3.467
lstat       -5.248e-01  5.072e-02 -10.347
            Pr(>|t|)    
(Intercept) 3.28e-12 ***
crim        0.001087 ** 
zn          0.000778 ***
indus       0.738288    
chas        0.001925 ** 
nox         4.25e-06 ***
rm           < 2e-16 ***
age         0.958229    
dis         6.01e-13 ***
rad         5.07e-06 ***
tax         0.001112 ** 
ptratio     1.31e-12 ***
black       0.000573 ***
lstat        < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
lm.fit=lm(medv~lstat,data=Boston)
anova(lm.fit,lm.fit2)
Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(mfrow=c(2,2))
plot(lm.fit2)

lm.fit5=lm(medv~poly(lstat,5))
summary(lm.fit5)

Call:
lm(formula = medv ~ poly(lstat, 5))

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value
(Intercept)       22.5328     0.2318  97.197
poly(lstat, 5)1 -152.4595     5.2148 -29.236
poly(lstat, 5)2   64.2272     5.2148  12.316
poly(lstat, 5)3  -27.0511     5.2148  -5.187
poly(lstat, 5)4   25.4517     5.2148   4.881
poly(lstat, 5)5  -19.2524     5.2148  -3.692
                Pr(>|t|)    
(Intercept)      < 2e-16 ***
poly(lstat, 5)1  < 2e-16 ***
poly(lstat, 5)2  < 2e-16 ***
poly(lstat, 5)3 3.10e-07 ***
poly(lstat, 5)4 1.42e-06 ***
poly(lstat, 5)5 0.000247 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,    Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16
summary(lm(medv~log(rm),data=Boston))

Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value
(Intercept)  -76.488      5.028  -15.21
log(rm)       54.055      2.739   19.73
            Pr(>|t|)    
(Intercept)   <2e-16 ***
log(rm)       <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

3.6.6 Qualitative Predictors

fix(Carseats)
names(Carseats)
 [1] "Sales"       "CompPrice"   "Income"     
 [4] "Advertising" "Population"  "Price"      
 [7] "ShelveLoc"   "Age"         "Education"  
[10] "Urban"       "US"         
lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats)
summary(lm.fit)

Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413 

Coefficients:
                     Estimate Std. Error
(Intercept)         6.5755654  1.0087470
CompPrice           0.0929371  0.0041183
Income              0.0108940  0.0026044
Advertising         0.0702462  0.0226091
Population          0.0001592  0.0003679
Price              -0.1008064  0.0074399
ShelveLocGood       4.8486762  0.1528378
ShelveLocMedium     1.9532620  0.1257682
Age                -0.0579466  0.0159506
Education          -0.0208525  0.0196131
UrbanYes            0.1401597  0.1124019
USYes              -0.1575571  0.1489234
Income:Advertising  0.0007510  0.0002784
Price:Age           0.0001068  0.0001333
                   t value Pr(>|t|)    
(Intercept)          6.519 2.22e-10 ***
CompPrice           22.567  < 2e-16 ***
Income               4.183 3.57e-05 ***
Advertising          3.107 0.002030 ** 
Population           0.433 0.665330    
Price              -13.549  < 2e-16 ***
ShelveLocGood       31.724  < 2e-16 ***
ShelveLocMedium     15.531  < 2e-16 ***
Age                 -3.633 0.000318 ***
Education           -1.063 0.288361    
UrbanYes             1.247 0.213171    
USYes               -1.058 0.290729    
Income:Advertising   2.698 0.007290 ** 
Price:Age            0.801 0.423812    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8719 
F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
attach(Carseats)
contrasts(ShelveLoc)
       Good Medium
Bad       0      0
Good      1      0
Medium    0      1

3.7 Exercises

Conceptual

  1. The Null hypothesis –> H0: B1 = 0 is that there is no relationship between our predictor (sales) and our corresponding coefficient advertising on (TV, radio, and newspaper). The coefficient, B, represents the slope of the change in Y given a change in X. Thus, if the slope = 0, there is no change.

3a.

Female: 85 + 20(GPA) + .07(IQ) + .01(GPA:IQ) -10(GPA)

Male: 50 + 20(GPA) + .07(IQ) + .01(GPA:IQ)

If GPA is 3.5, Males and Females have the same starting salary. If below 3.5, females will have a higher starting salary. If above 3.5, males will have a higher starting salary.

3b.

Female: 85 + 20(4) + .07(110) + .01(4:110) -10(4) = 137.1

3c.

False, a small coefficient simply tells us the slope of the relationship between our parameter and response variable. A large p-value would indicate little evidence of an interaction effect.

4a.

We would expect the training Residual Sum of Squares for the cubic regression to be less because the model would be able to fit the data more accurately. However, this is complicated if the true form of the data is linear.

4b.

Again, it depends on the true form of the data, but in the case that the cubic regression in the training data “overfit” the data, then we would expect it to have low applicability to the real “test” data and thus would expect the RSS to be higher for the cubic regression in this case.

4c. Again, we would expect the cubic regression to have a lower RSS.

4d. It depends on how close to linear the true model is. If the model is closer to linear, then the linear regression is better. However, if the true model is not very close to linear at all then the cubic regression is much more appropriate.

  1. Note in the case of simple linear regression, the least squares line always passes through the means of the predictor and the response.

Applied

8a.

lm.auto=lm(mpg~horsepower,data=Auto)
summary(lm.auto)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value
(Intercept) 39.935861   0.717499   55.66
horsepower  -0.157845   0.006446  -24.49
            Pr(>|t|)    
(Intercept)   <2e-16 ***
horsepower    <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. The F-statistic validates the model and the p-value demonstrates the statistical significance of the coefficient.

  2. There is an inverse relationship between the predictor and response. An increase in horsepower of 100 leads to a reduction in mpg of 15. Accordingly, our adjusted R-Squared tells us that about 60% of the variation in our mpg can be explained by horsepower.

  3. Negative

predict(lm.auto,data.frame(horsepower=c(98)),interval="confidence")
       fit      lwr      upr
1 24.46708 23.97308 24.96108
predict(lm.auto,data.frame(horsepower=c(98)),interval="prediction")
       fit     lwr      upr
1 24.46708 14.8094 34.12476
attach(Auto)
The following objects are masked from Auto (pos = 3):

    acceleration, cylinders, displacement,
    horsepower, mpg, name, origin, weight,
    year
plot(horsepower,mpg)
abline(lm.auto,col="red")

par(mfrow=c(2,2))
plot(lm.auto)

Examining the Residuals vs Fitted plot demonstrates that our residuals are tracking our fitted values. This demonstrates an issue of non-linearity of the data. We could potentially resolve this by plotting adding horsepower^2 as the plot of our relationship demonstrates a non-linear relationship. We also appear to have some high leverage points that can potentially be further fleshed out via the studentized residuals. The plot below shows

plot(predict(lm.auto),rstudent(lm.auto),col=ifelse(rstudent(lm.auto)>=3,"red","black"))+
text(predict(lm.auto),rstudent(lm.auto),labels=ifelse(rstudent(lm.auto)>=2.5,names(which(rstudent(lm.auto)>=3)),""),pos=4)
integer(0)

plot(hatvalues(lm.auto),col=ifelse(hatvalues(lm.auto)>.028,"red","black"))+text(hatvalues(lm.auto),labels=ifelse(hatvalues(lm.auto)>.028,names(which.max(hatvalues(lm.auto)>.028)),""),cex=.7,pos=4)
integer(0)

pairs(~.,data=Auto)

cor(Auto[,-9])
                    mpg  cylinders displacement
mpg           1.0000000 -0.7776175   -0.8051269
cylinders    -0.7776175  1.0000000    0.9508233
displacement -0.8051269  0.9508233    1.0000000
horsepower   -0.7784268  0.8429834    0.8972570
weight       -0.8322442  0.8975273    0.9329944
acceleration  0.4233285 -0.5046834   -0.5438005
year          0.5805410 -0.3456474   -0.3698552
origin        0.5652088 -0.5689316   -0.6145351
             horsepower     weight acceleration
mpg          -0.7784268 -0.8322442    0.4233285
cylinders     0.8429834  0.8975273   -0.5046834
displacement  0.8972570  0.9329944   -0.5438005
horsepower    1.0000000  0.8645377   -0.6891955
weight        0.8645377  1.0000000   -0.4168392
acceleration -0.6891955 -0.4168392    1.0000000
year         -0.4163615 -0.3091199    0.2903161
origin       -0.4551715 -0.5850054    0.2127458
                   year     origin
mpg           0.5805410  0.5652088
cylinders    -0.3456474 -0.5689316
displacement -0.3698552 -0.6145351
horsepower   -0.4163615 -0.4551715
weight       -0.3091199 -0.5850054
acceleration  0.2903161  0.2127458
year          1.0000000  0.1815277
origin        0.1815277  1.0000000
lm.auto2=lm(mpg~.-name,data=Auto)
summary(lm.auto2)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value
(Intercept)  -17.218435   4.644294  -3.707
cylinders     -0.493376   0.323282  -1.526
displacement   0.019896   0.007515   2.647
horsepower    -0.016951   0.013787  -1.230
weight        -0.006474   0.000652  -9.929
acceleration   0.080576   0.098845   0.815
year           0.750773   0.050973  14.729
origin         1.426141   0.278136   5.127
             Pr(>|t|)    
(Intercept)   0.00024 ***
cylinders     0.12780    
displacement  0.00844 ** 
horsepower    0.21963    
weight        < 2e-16 ***
acceleration  0.41548    
year          < 2e-16 ***
origin       4.67e-07 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Yes, excluding cylinders, horsepower and acceleration. The F-statistic validates the model. The model acounts for about 82% of the variation in mpg.

  2. Every ten years, the average mpg increases by 7.5

par(mfrow=c(2,2))
plot(lm.auto2)

The data appears to be pretty linear. The residuals seem to be pretty equally distributed along both sides of the line of fit so there does not appear to be evidence of correlation among sequential residuals. We may have some outliers and residuals which are labeled in the plot.

lm.auto3=lm(mpg~.-name+horsepower:acceleration-cylinders-displacement,data=Auto)
summary(lm.auto3)

Call:
lm(formula = mpg ~ . - name + horsepower:acceleration - cylinders - 
    displacement, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9104 -1.8819 -0.0609  1.8141 12.4503 

Coefficients:
                          Estimate Std. Error
(Intercept)             -3.150e+01  4.692e+00
horsepower               1.127e-01  2.035e-02
weight                  -4.515e-03  4.994e-04
acceleration             9.307e-01  1.511e-01
year                     7.601e-01  4.793e-02
origin                   1.151e+00  2.459e-01
horsepower:acceleration -1.123e-02  1.530e-03
                        t value Pr(>|t|)    
(Intercept)              -6.715 6.77e-11 ***
horsepower                5.538 5.67e-08 ***
weight                   -9.041  < 2e-16 ***
acceleration              6.159 1.84e-09 ***
year                     15.856  < 2e-16 ***
origin                    4.680 3.98e-06 ***
horsepower:acceleration  -7.339 1.29e-12 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.141 on 385 degrees of freedom
Multiple R-squared:  0.8405,    Adjusted R-squared:  0.838 
F-statistic: 338.1 on 6 and 385 DF,  p-value: < 2.2e-16
lm.auto4=lm(mpg~horsepower+I(horsepower^2),data=Auto)
summary(lm.auto4)

Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7135  -2.5943  -0.0859   2.2868  15.8961 

Coefficients:
                  Estimate Std. Error t value
(Intercept)     56.9000997  1.8004268   31.60
horsepower      -0.4661896  0.0311246  -14.98
I(horsepower^2)  0.0012305  0.0001221   10.08
                Pr(>|t|)    
(Intercept)       <2e-16 ***
horsepower        <2e-16 ***
I(horsepower^2)   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686 
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.auto4)

attach(Carseats)

10a.

summary(Carseats)
     Sales          CompPrice  
 Min.   : 0.000   Min.   : 77  
 1st Qu.: 5.390   1st Qu.:115  
 Median : 7.490   Median :125  
 Mean   : 7.496   Mean   :125  
 3rd Qu.: 9.320   3rd Qu.:135  
 Max.   :16.270   Max.   :175  
     Income        Advertising    
 Min.   : 21.00   Min.   : 0.000  
 1st Qu.: 42.75   1st Qu.: 0.000  
 Median : 69.00   Median : 5.000  
 Mean   : 68.66   Mean   : 6.635  
 3rd Qu.: 91.00   3rd Qu.:12.000  
 Max.   :120.00   Max.   :29.000  
   Population        Price        ShelveLoc  
 Min.   : 10.0   Min.   : 24.0   Bad   : 96  
 1st Qu.:139.0   1st Qu.:100.0   Good  : 85  
 Median :272.0   Median :117.0   Medium:219  
 Mean   :264.8   Mean   :115.8               
 3rd Qu.:398.5   3rd Qu.:131.0               
 Max.   :509.0   Max.   :191.0               
      Age          Education    Urban    
 Min.   :25.00   Min.   :10.0   No :118  
 1st Qu.:39.75   1st Qu.:12.0   Yes:282  
 Median :54.50   Median :14.0            
 Mean   :53.32   Mean   :13.9            
 3rd Qu.:66.00   3rd Qu.:16.0            
 Max.   :80.00   Max.   :18.0            
   US     
 No :142  
 Yes:258  
          
          
          
          
contrasts(ShelveLoc)
       Good Medium
Bad       0      0
Good      1      0
Medium    0      1
lm.cars=lm(Sales~Price+Urban+US,data=Carseats)
summary(lm.cars)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value
(Intercept) 13.043469   0.651012  20.036
Price       -0.054459   0.005242 -10.389
UrbanYes    -0.021916   0.271650  -0.081
USYes        1.200573   0.259042   4.635
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
Price        < 2e-16 ***
UrbanYes       0.936    
USYes       4.86e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  1. Price has a negative relationship with Sales. Thus, a one-unit increase in the sale price of carseats results in a reduction in sales by .05. UrbanYes I’m assuming is a measure of whether or not it is sold in an area and while although not statistically significant, the relationship suggests that sales of carseats in urban areas are .02 lower than their baseline counterpart of not in urban areas. I believe USYEs is similar. If it is sold in the US it’s sales are 1.2 higher than their baseline of not equivalent or stores in the US do about $1200 better in terms of sales.

  2. Sales = 13.04 - .05(Price) - 02(IfUrban) + 1.2(IfUS) if Urban and US “NO” then Sales = 13.04 -.05(Price)

  3. Price and USYes

lm.cars2=lm(Sales~Price+US,data=Carseats)
summary(lm.cars2)

Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value
(Intercept) 13.03079    0.63098  20.652
Price       -0.05448    0.00523 -10.416
USYes        1.19964    0.25846   4.641
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
Price        < 2e-16 ***
USYes       4.71e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. After removing “Urban” we note that there was almost no change in the R-Squared value demonstrating that Urban was not useful in explaining variation in sales. Furthermore, we note a greater F-Statistic and the Residual Standard Error actually decreased (although not by much) in our newer model.

confint(lm.cars2)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632
par(mfrow=c(2,2))
plot(lm.cars2)

The plots look pretty good. No evident issues of non-linearity or correlated residuals, but the plot of residuals vs leverage demonstrates that we likely have some high leverage points.

set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)
lm.equat=lm(y~x+0)
summary(lm.equat)

Call:
lm(formula = y ~ x + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9154 -0.6472 -0.1771  0.5056  2.3109 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x   1.9939     0.1065   18.73   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9586 on 99 degrees of freedom
Multiple R-squared:  0.7798,    Adjusted R-squared:  0.7776 
F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
lm.equat2=lm(x~y+0)
summary(lm.equat2)

Call:
lm(formula = x ~ y + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8699 -0.2368  0.1030  0.2858  0.8938 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
y  0.39111    0.02089   18.73   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4246 on 99 degrees of freedom
Multiple R-squared:  0.7798,    Adjusted R-squared:  0.7776 
F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
  1. The results are the same except for the estimates of the coefficients and the residuals. They have the same R-squareds, F-statistics and p-values.

lm.equat=lm(y~x)
summary(lm.equat)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8768 -0.6138 -0.1395  0.5394  2.3462 

Coefficients:
            Estimate Std. Error t value
(Intercept) -0.03769    0.09699  -0.389
x            1.99894    0.10773  18.556
            Pr(>|t|)    
(Intercept)    0.698    
x             <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9628 on 98 degrees of freedom
Multiple R-squared:  0.7784,    Adjusted R-squared:  0.7762 
F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
lm.equat2=lm(x~y)
summary(lm.equat2)

Call:
lm(formula = x ~ y)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.90848 -0.28101  0.06274  0.24570  0.85736 

Coefficients:
            Estimate Std. Error t value
(Intercept)  0.03880    0.04266    0.91
y            0.38942    0.02099   18.56
            Pr(>|t|)    
(Intercept)    0.365    
y             <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4249 on 98 degrees of freedom
Multiple R-squared:  0.7784,    Adjusted R-squared:  0.7762 
F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
  1. (Skipped)

set.seed(1)
x=rnorm(100,0,1)
eps=rnorm(100,0,.25)
y=-1+.5*x+eps
length(y)
[1] 100

B0 = -1 and B1 = .5

scatter.smooth(x,y)

The relationship is perfectly linear.

lm.equat3=lm(y~x)
summary(lm.equat3)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46921 -0.15344 -0.03487  0.13485  0.58654 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
x            0.49973    0.02693   18.56   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2407 on 98 degrees of freedom
Multiple R-squared:  0.7784,    Adjusted R-squared:  0.7762 
F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

The predicted values of B0 and B1 are almost identical to the actual values.

coef(lm.equat3)
(Intercept)           x 
 -1.0094232   0.4997349 
plot(x,y,pch=20)
abline(coef=c(-1,0.5),col="red")
abline(coef(lm.equat3),col="green")
legend("bottomright",legend=c("Population Regression","Least Square"),col=c("red","green"),lty=c(1,1))

lm.equat4=lm(y~x+I(x^2))
summary(lm.equat4)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4913 -0.1563 -0.0322  0.1451  0.5675 

Coefficients:
            Estimate Std. Error t value
(Intercept) -0.98582    0.02941 -33.516
x            0.50429    0.02700  18.680
I(x^2)      -0.02973    0.02119  -1.403
            Pr(>|t|)    
(Intercept)   <2e-16 ***
x             <2e-16 ***
I(x^2)         0.164    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2395 on 97 degrees of freedom
Multiple R-squared:  0.7828,    Adjusted R-squared:  0.7784 
F-statistic: 174.8 on 2 and 97 DF,  p-value: < 2.2e-16

No, our predictive power (in terms of R-squared) didn’t improve and the strength of our F-statistic decreased. It is still significant, but worse.

set.seed(1)
x1=runif(100)
x2=0.5*x1+rnorm(100)/10
y=2+2*x1+.3*x2+rnorm(100)

The regression coefficients: B0 = 2 B1 = 2 B2 = .3

cor(x1,x2)
[1] 0.8351212
plot(x1,x2,col="black")

lm.equat4=lm(y~x1+x2)
summary(lm.equat4)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8311 -0.7273 -0.0537  0.6338  2.3359 

Coefficients:
            Estimate Std. Error t value
(Intercept)   2.1305     0.2319   9.188
x1            1.4396     0.7212   1.996
x2            1.0097     1.1337   0.891
            Pr(>|t|)    
(Intercept) 7.61e-15 ***
x1            0.0487 *  
x2            0.3754    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.056 on 97 degrees of freedom
Multiple R-squared:  0.2088,    Adjusted R-squared:  0.1925 
F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

B0 = 2.13 B1 = 1.4 B2 = 1

These predicted coefficients are different from their population regression values. The most different is the coefficient of x2 which is likely due to its low level of statistical significance. We can reject the null H0:B1=0, but not the null H0:B2=0.

lm.equat5=lm(y~x1)
summary(lm.equat5)

Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.89495 -0.66874 -0.07785  0.59221  2.45560 

Coefficients:
            Estimate Std. Error t value
(Intercept)   2.1124     0.2307   9.155
x1            1.9759     0.3963   4.986
            Pr(>|t|)    
(Intercept) 8.27e-15 ***
x1          2.66e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.055 on 98 degrees of freedom
Multiple R-squared:  0.2024,    Adjusted R-squared:  0.1942 
F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06
lm.equat6=lm(y~x2)
summary(lm.equat6)

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62687 -0.75156 -0.03598  0.72383  2.44890 

Coefficients:
            Estimate Std. Error t value
(Intercept)   2.3899     0.1949   12.26
x2            2.8996     0.6330    4.58
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
x2          1.37e-05 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.072 on 98 degrees of freedom
Multiple R-squared:  0.1763,    Adjusted R-squared:  0.1679 
F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05
  1. The results obtain in (c)-(e) do not contradict each other. It is entirely possible that two variables are not jointly significant, but can be independently significant. What is likely occurring is that x1 and x2 are highly correlated and thus when the two are included together, x2 does not provide any new information that x1 does not already account for.

  2. (Skipped)

15.

attach(Boston)
The following objects are masked from Boston (pos = 3):

    age, black, chas, crim, dis, indus,
    lstat, medv, nox, ptratio, rad, rm,
    tax, zn
names(Boston)
 [1] "crim"    "zn"      "indus"   "chas"   
 [5] "nox"     "rm"      "age"     "dis"    
 [9] "rad"     "tax"     "ptratio" "black"  
[13] "lstat"   "medv"   
lm.b1=lm(crim~zn)
summary(lm.b1)

Call:
lm(formula = crim ~ zn)

Residuals:
   Min     1Q Median     3Q    Max 
-4.429 -4.222 -2.620  1.250 84.523 

Coefficients:
            Estimate Std. Error t value
(Intercept)  4.45369    0.41722  10.675
zn          -0.07393    0.01609  -4.594
            Pr(>|t|)    
(Intercept)  < 2e-16 ***
zn          5.51e-06 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.435 on 504 degrees of freedom
Multiple R-squared:  0.04019,   Adjusted R-squared:  0.03828 
F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06
lm.b2=lm(crim~indus)
summary(lm.b2)

Call:
lm(formula = crim ~ indus)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.972  -2.698  -0.736   0.712  81.813 

Coefficients:
            Estimate Std. Error t value
(Intercept) -2.06374    0.66723  -3.093
indus        0.50978    0.05102   9.991
            Pr(>|t|)    
(Intercept)  0.00209 ** 
indus        < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.866 on 504 degrees of freedom
Multiple R-squared:  0.1653,    Adjusted R-squared:  0.1637 
F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b3=lm(crim~chas)
summary(lm.b3)

Call:
lm(formula = crim ~ chas)

Residuals:
   Min     1Q Median     3Q    Max 
-3.738 -3.661 -3.435  0.018 85.232 

Coefficients:
            Estimate Std. Error t value
(Intercept)   3.7444     0.3961   9.453
chas         -1.8928     1.5061  -1.257
            Pr(>|t|)    
(Intercept)   <2e-16 ***
chas           0.209    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.597 on 504 degrees of freedom
Multiple R-squared:  0.003124,  Adjusted R-squared:  0.001146 
F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094
lm.b4=lm(crim~nox)
summary(lm.b4)

Call:
lm(formula = crim ~ nox)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.371  -2.738  -0.974   0.559  81.728 

Coefficients:
            Estimate Std. Error t value
(Intercept)  -13.720      1.699  -8.073
nox           31.249      2.999  10.419
            Pr(>|t|)    
(Intercept) 5.08e-15 ***
nox          < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.81 on 504 degrees of freedom
Multiple R-squared:  0.1772,    Adjusted R-squared:  0.1756 
F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b5=lm(crim~rm)
summary(lm.b5)

Call:
lm(formula = crim ~ rm)

Residuals:
   Min     1Q Median     3Q    Max 
-6.604 -3.952 -2.654  0.989 87.197 

Coefficients:
            Estimate Std. Error t value
(Intercept)   20.482      3.365   6.088
rm            -2.684      0.532  -5.045
            Pr(>|t|)    
(Intercept) 2.27e-09 ***
rm          6.35e-07 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.401 on 504 degrees of freedom
Multiple R-squared:  0.04807,   Adjusted R-squared:  0.04618 
F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07
lm.b6=lm(crim~age)
summary(lm.b6)

Call:
lm(formula = crim ~ age)

Residuals:
   Min     1Q Median     3Q    Max 
-6.789 -4.257 -1.230  1.527 82.849 

Coefficients:
            Estimate Std. Error t value
(Intercept) -3.77791    0.94398  -4.002
age          0.10779    0.01274   8.463
            Pr(>|t|)    
(Intercept) 7.22e-05 ***
age         2.85e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.057 on 504 degrees of freedom
Multiple R-squared:  0.1244,    Adjusted R-squared:  0.1227 
F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16
lm.b7=lm(crim~dis)
summary(lm.b7)

Call:
lm(formula = crim ~ dis)

Residuals:
   Min     1Q Median     3Q    Max 
-6.708 -4.134 -1.527  1.516 81.674 

Coefficients:
            Estimate Std. Error t value
(Intercept)   9.4993     0.7304  13.006
dis          -1.5509     0.1683  -9.213
            Pr(>|t|)    
(Intercept)   <2e-16 ***
dis           <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.965 on 504 degrees of freedom
Multiple R-squared:  0.1441,    Adjusted R-squared:  0.1425 
F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b8=lm(crim~rad)
summary(lm.b8)

Call:
lm(formula = crim ~ rad)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.164  -1.381  -0.141   0.660  76.433 

Coefficients:
            Estimate Std. Error t value
(Intercept) -2.28716    0.44348  -5.157
rad          0.61791    0.03433  17.998
            Pr(>|t|)    
(Intercept) 3.61e-07 ***
rad          < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.718 on 504 degrees of freedom
Multiple R-squared:  0.3913,    Adjusted R-squared:   0.39 
F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b9=lm(crim~tax)
summary(lm.b9)

Call:
lm(formula = crim ~ tax)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.513  -2.738  -0.194   1.065  77.696 

Coefficients:
             Estimate Std. Error t value
(Intercept) -8.528369   0.815809  -10.45
tax          0.029742   0.001847   16.10
            Pr(>|t|)    
(Intercept)   <2e-16 ***
tax           <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.997 on 504 degrees of freedom
Multiple R-squared:  0.3396,    Adjusted R-squared:  0.3383 
F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b10=lm(crim~ptratio)
summary(lm.b10)

Call:
lm(formula = crim ~ ptratio)

Residuals:
   Min     1Q Median     3Q    Max 
-7.654 -3.985 -1.912  1.825 83.353 

Coefficients:
            Estimate Std. Error t value
(Intercept) -17.6469     3.1473  -5.607
ptratio       1.1520     0.1694   6.801
            Pr(>|t|)    
(Intercept) 3.40e-08 ***
ptratio     2.94e-11 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.24 on 504 degrees of freedom
Multiple R-squared:  0.08407,   Adjusted R-squared:  0.08225 
F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11
lm.b11=lm(crim~black)
summary(lm.b11)

Call:
lm(formula = crim ~ black)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.756  -2.299  -2.095  -1.296  86.822 

Coefficients:
             Estimate Std. Error t value
(Intercept) 16.553529   1.425903  11.609
black       -0.036280   0.003873  -9.367
            Pr(>|t|)    
(Intercept)   <2e-16 ***
black         <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.946 on 504 degrees of freedom
Multiple R-squared:  0.1483,    Adjusted R-squared:  0.1466 
F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b12=lm(crim~lstat)
summary(lm.b12)

Call:
lm(formula = crim ~ lstat)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.925  -2.822  -0.664   1.079  82.862 

Coefficients:
            Estimate Std. Error t value
(Intercept) -3.33054    0.69376  -4.801
lstat        0.54880    0.04776  11.491
            Pr(>|t|)    
(Intercept) 2.09e-06 ***
lstat        < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.664 on 504 degrees of freedom
Multiple R-squared:  0.2076,    Adjusted R-squared:  0.206 
F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16
lm.b13=lm(crim~medv)
summary(lm.b13)

Call:
lm(formula = crim ~ medv)

Residuals:
   Min     1Q Median     3Q    Max 
-9.071 -4.022 -2.343  1.298 80.957 

Coefficients:
            Estimate Std. Error t value
(Intercept) 11.79654    0.93419   12.63
medv        -0.36316    0.03839   -9.46
            Pr(>|t|)    
(Intercept)   <2e-16 ***
medv          <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.934 on 504 degrees of freedom
Multiple R-squared:  0.1508,    Adjusted R-squared:  0.1491 
F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

The only simple linear regression model that failed to generate a statistically significant association between the predictor and the response was the relationship between chas and crim.

par(mfrow=c(2,2))
plot(zn,crim)
plot(indus,crim)
plot(nox,crim)
plot(rm,crim)

lm.Boston=lm(crim~.,data=Boston)
summary(lm.Boston)

Call:
lm(formula = crim ~ ., data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.924 -2.120 -0.353  1.019 75.051 

Coefficients:
              Estimate Std. Error t value
(Intercept)  17.033228   7.234903   2.354
zn            0.044855   0.018734   2.394
indus        -0.063855   0.083407  -0.766
chas         -0.749134   1.180147  -0.635
nox         -10.313535   5.275536  -1.955
rm            0.430131   0.612830   0.702
age           0.001452   0.017925   0.081
dis          -0.987176   0.281817  -3.503
rad           0.588209   0.088049   6.680
tax          -0.003780   0.005156  -0.733
ptratio      -0.271081   0.186450  -1.454
black        -0.007538   0.003673  -2.052
lstat         0.126211   0.075725   1.667
medv         -0.198887   0.060516  -3.287
            Pr(>|t|)    
(Intercept) 0.018949 *  
zn          0.017025 *  
indus       0.444294    
chas        0.525867    
nox         0.051152 .  
rm          0.483089    
age         0.935488    
dis         0.000502 ***
rad         6.46e-11 ***
tax         0.463793    
ptratio     0.146611    
black       0.040702 *  
lstat       0.096208 .  
medv        0.001087 ** 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.439 on 492 degrees of freedom
Multiple R-squared:  0.454, Adjusted R-squared:  0.4396 
F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Our F-statistic is pretty low (31.47). The p-value on our F-test indicates a significant model, but we can do better. After accounting for everything in the model we have only explained 44% of the variation in crim. Using a 95% confidence threshold we note that the following variables are not statistically significant: indus, chas, nox, rm, age, tax, ptration and lstat.

In the first model only chas failed as to provide any statistical significance regarding a relationship between the predictor and response. However, when all of our variables are included in the model we note that many more are excluded as viable predictors. This signifies a great deal of collinearity within our model. Fundamentally, many of our variables are highly correlated and thus become redundant when explaining variation in crim.

---
title: "Chapter_3_Linear_Regression"
output: html_notebook
---

__*Linear regression*__, a very simple approach for supervised learning. A useful tool for predicting a quantitative response.


#### __3.1 Simple Linear Regression__

- Predicts a quantitative response *Y* on the basis of a single predictor variable *X*.

- Assumes a linear relationship between *X* and *Y*.

__Regression Equation:__

*Y* ≈ *B*~0~ + *B*~1~(X)

*B*~0~ and *B*~1~ represent the *intercept* and *slope* terms in the linear model. They are the model *coefficients* or *parameters*.


#### __3.1.1 Estimating the Coefficients__

Fitting a line is typically done by minimizing the *least squares* criterion.

*Residual*-- is the difference between the *i*th observed response value and the *i*th response value that is predicted by our linear model.

Thus, the __*Residual Sum of Squares (RSS)*__ is simply the sum of the squared residuals.

__RSS__ = *e*^2^~1~ + *e*^2^~2~ + ... + *e*^2^~*n*~

#### __3.1.2 Assessing the Accuracy of the Coefficient Estimates__

Approximating *f* as a linear function, we can write this relationship as:

*Y* = *B*~0~ + *B*~1~(X) + *E*

Where:
__*B*~0~__ is the intercept term--the expected value of *Y* when *X* = 0.

and

__*B*~1~__ is the slope--the average increase in *Y* associated with a one-unit increase in *X*.

__The error is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in *Y*, and there may be measurement error.__

__"An unbiased estimator does not *systematically* over- or under-estimate the true parameter.__

The __*standard error*__ tells us approximately how much our estimate of a given parameter, mean, etc.is an over/under estimation.

It is equal to the standard deviation divided by the sample size.

Thus, the simplest and most effective way to reduce the standard error is to increase the sample size.

*Standard errors* are used to compute __confidence intervals__.

__*P-value*__:

> "Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response."

#### __3.1.3 Assessing the Accuracy of the Model__

"How well does the model fit the data?"

"The quality of a linear regression fit is typically assessed using two related quantities:

1. __Residual Standard Error__
2. __R^2^ Statistic__


__Residual Standard Error__

Is an estimate of the standard deviation of *E*. Roughly speaking, it is the average amount that the response will deviate from the true regression line. 

It is a measure of *lack of fit* of the model to the data.

__R^2^ Statistic__

The *R*^2^ statistic takes the form of a *proportion*--the proportion of variance explained.

__*R*^2^ = 1 - (RSS/TSS)__


#### __3.2 Multiple Linear Regression__

"Extends the simple linear regression model so that it can directly accomodate multiple predictors."

The multiple linear regression model:

__*Y = B~0~ + B~1~X~1~ + B~2~X~2~ + ... B~p~X~p~ + E*__

__Where__ *X~j~* represents the *j*th predictor and *B~j~* quantifies the association between that variable and the response. We interpret *B~j~* as the *average* effect on *Y* of a one unit increase in *X~j~*, *holding all other predictors fixed*.

#### __3.2.1 Estimating the Regression Coefficients__

#### __3.2.2 Some Important Questions__

1. Is at least one of the predictors *X~1~, X~2~, ..., X~p~* useful in predicting the response?
2. Do all the predictors help to explain Y, or is only a subset of the predictors useful?*
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

__One: Is There a Relationship Between the Response and Predictors?__

Are all the regression coefficients zero?

H~o~: *B~1~ = B~2~ = ... = B~p~ =* 0

H~a~: at least one *B~j~* is non-zero.

__This hypothesis test is performed by computing the *F-statistic*,

If there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. If *H~a~* is true, then we expect *F* to be greater than 1.

__Two: Deciding on Important Variables__

__Three Classical Approaches for model selection__

* *Forward selection*. Build the model from an intercept by adding parameters one by one until we get the one with the lowest RSS.

* *Backward selection*. Start with all of the variables in the model, and remove the variable with the largest p-value. We continue to remove p-values until a stopping rule is reached. For example, all remaining variables have a p-value below some threshold.

* *Mixed Selection*: Combination of the above two methods.

__Three: Model Fit__

_What is happening with the R^2^ and the RSE?_

__Four: Predictions__

*Three sorts of uncertainty associated with predicted the response Y based on our estimated parameters.

1. The coefficients are only estimates of the true population coefficients and thus is only an *estimate of the true population regression line*.

2. There is also potentially reducible error associated with model selection referred to as, *model bias*.

3. There is always random error *e* in the model. 

#### __3.3 Other Considerations in the Regression Model__

__3.3.1 Qualitative Predictors__

*Predictors with Only Two Levels*

A *dummy variable* is used (a variable that takes on only two possible numerical values).

__For example:__

*x~i~* = 1 if *i*th person is female
and
*x~i~* = 0 if *i*th person is male

__Resulting in the model__

*y~i~ = B~0~ + B~1~x~i~ + E~i~

Where *B~1~* = 0 if MALE

__Qualitative Predictors with More than Two Levels__

> "When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the *ethnicity* variable we create two dummy variables."

For example:

*x~i1~* = 1 --> if *i*th person is Asian
and 0 if *i*th person is not Asian,

and the second could be

*x~i2~* = 1 if *i*th person is Caucasian
and 0 if *i*th person is not Caucasian.

The level with no dummy variable is the *baseline*.

#### __3.3.2 Extensions of the Linear Model__

Two important assumptions of the linear regression model:

1. The *additive assumption* means that the effect of changes in a predictor X~j~ on the response Y is independent of the values of the other predictors. 

2. The *linear assumption* states that the change in the response Y due to a one-unit change in X~j~ is constant, regardless of the value of X*j*. 

__Removing the Additive Assumption__

There may be synergistic cases, where an increase in one variable has a >1 effect on another variable (it increases its slope). Using an *interaction term*, allows us to incorporte this in linear models.

__Non-linear relationships__

You can use polynomials in linear models to better capture relationships with some curvature .

*Polynomial regression*

#### __3.3.3 Potential Problems__

1. Non-linearity of the response-predictor relationships
2. Correlation of error terms
3. Non-constant variance of error terms
4. Outliers
5. High-leverage points
6. Collinearity

__1. Non-linearity of the Data__

*Residual plots* are a useful graphical tool for identifying non-linearity.

__2. Correlation of Error Terms__

"If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors... In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model."

Correlations in the context of *time series* data frequently occur in the context of *time series* data. 

__3. Non-constant Variance of Error Terms__

*Heteroscedasticity*: non-constant variances in the errors. A *funnel shape* in the residual plot is an easy way to identify its presence. 

Heteroscedasticity occurs when the magnitude of the residuals tends to increase with the fitted values.

__4. Outliers__ 

An *outlier* is a point for which *y~i~* is far from the value predicted by the model.

Plotting the *studentized residuals* helps identify outliers.

-Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.

Important to remember that Outliers refer to observations for which the response *y~i~* is unusual given the predictor *x~i~*.

__5. High Leverage Points__

Observations with *high leverage* have an unusual value for *x~i~*.

__6. Collinearity__

Refers to the situation in which two or more predictor variables are closely related to one another. 

The presence of collinearity can make it difficult to separate out the individual effects of collinear variables on the response.

The *power* of the hypothesis test--the probability of correctly detecting a *non-zero* coefficient--is reduced by collinearity.

*Multicollinearity* can occur even if no pair of variables has a particularly high correlation, but if collinearity exists between three or more variables.

The *Variance Inflation Factor* __(VIF)__ is a better way of assessing multicollinearity.

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

__Two simple solutions:__

1. Drop one of the problematic variables from the regression.

2. Combine the collinear variables together into a single predictor. 

#### __3.4 The Marketing Plan__

(Basically just a review of everything above)

#### __3.5 Comparison of Linear Regression with K-Nearest Neighbors__

K-nearest neighbors regression (KNN regression).



#### __3.6 Lab: Linear Regression

__3.6.1 Libraries__

```{r}
library(MASS)
library(ISLR)
```

```{r}
fix(Boston)
names(Boston)
```

```{r}
lm.fit=lm(medv~lstat,Boston)
attach(Boston)
lm.fit
```

```{r}
summary(lm.fit)
```

```{r}
names(lm.fit)
```

```{r}
confint(lm.fit)
```

```{r}
predict(lm.fit,data.frame(lstat=c(5,10,15)),interval="confidence")
```

```{r}
predict(lm.fit,data.frame(lstat=c(5,10,15)),interval="prediction")
```

```{r}
plot(lstat,medv)
abline(lm.fit)
```

```{r}
plot(lstat,medv)
abline(lm.fit)
abline(lm.fit,lwd=3)
abline(lm.fit,lwd=3,col="red")
plot(lstat,medv,col="red")
plot(lstat,medv,pch=20)
plot(lstat,medv,pch="+")
plot(1:20,1:20,pch=1:20)
```

```{r}
par(mfrow=c(2,2))
plot(lm.fit)
```

```{r}
plot(predict(lm.fit),residuals(lm.fit))
plot(predict(lm.fit),rstudent(lm.fit))
```

```{r}
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
```

#### __3.6.3 Multiple Linear Regression__

```{r}
lm.fit=lm(medv~lstat+age,data=Boston)
summary(lm.fit)
```

```{r}
lm.fit=lm(medv~.,data=Boston)
summary(lm.fit)
```

```{r}
lm.fit1=lm(medv~.-age,data=Boston)
summary(lm.fit1)
```

#### __3.6.4 Interaction terms__

```{r}
summary(lm(medv~lstat*age,data=Boston))
```

#### __3.6.5 Non-linear Transformations of the Predictors__

```{r}
lm.fit2=lm(medv~lstat+I(lstat^2))
summary(lm.fit)
```

```{r}
lm.fit=lm(medv~lstat,data=Boston)
anova(lm.fit,lm.fit2)
```

```{r}
par(mfrow=c(2,2))
plot(lm.fit2)
```

```{r}
lm.fit5=lm(medv~poly(lstat,5))
summary(lm.fit5)
```

```{r}
summary(lm(medv~log(rm),data=Boston))
```

#### __3.6.6 Qualitative Predictors__

```{r}
fix(Carseats)
names(Carseats)
```

```{r}
lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats)
summary(lm.fit)
```

```{r}
attach(Carseats)
contrasts(ShelveLoc)
```

#### __3.7 Exercises__

*Conceptual*

1. The Null hypothesis --> __H~0~:__ *B~1~* = 0 is that there is no relationship between our predictor (sales) and our corresponding coefficient advertising on (TV, radio, and newspaper). The coefficient, B, represents the slope of the change in Y given a change in X. Thus, if the slope = 0, there is no change. 

2. 

3a.

__Female:__ 85 + 20(GPA) + .07(IQ) + .01(GPA:IQ) -10(GPA)

__Male:__ 50 + 20(GPA) + .07(IQ) + .01(GPA:IQ)

If GPA is 3.5, Males and Females have the same starting salary. If below 3.5, females will have a higher starting salary. If above 3.5, males will have a higher starting salary.

3b.

__Female:__ 85 + 20(4) + .07(110) + .01(4:110) -10(4) = 137.1

3c.

False, a small coefficient simply tells us the slope of the relationship between our parameter and response variable. A large p-value would indicate little evidence of an interaction effect. 

4a.

We would expect the training Residual Sum of Squares for the cubic regression to be less because the model would be able to fit the data more accurately. However, this is complicated if the true form of the data is linear.

4b.

Again, it depends on the true form of the data, but in the case that the cubic regression in the training data "overfit" the data, then we would expect it to have low applicability to the real "test" data and thus would expect the RSS to be higher for the cubic regression in this case.

4c. Again, we would expect the cubic regression to have a lower RSS.

4d. It depends on how close to linear the true model is. If the model is closer to linear, then the linear regression is better. However, if the true model is not very close to linear at all then the cubic regression is much more appropriate.

5.

6. __Note__ in the case of simple linear regression, the least squares line always passes through the means of the predictor and the response.

7.

__*Applied*__


8a.
```{r}
lm.auto=lm(mpg~horsepower,data=Auto)
summary(lm.auto)
```

i. The F-statistic validates the model and the p-value demonstrates the statistical significance of the coefficient.

ii. There is an inverse relationship between the predictor and response. An increase in horsepower of 100 leads to a reduction in mpg of 15. Accordingly, our adjusted R-Squared tells us that about 60% of the variation in our mpg can be explained by horsepower.

iii. Negative

iv.


```{r}
predict(lm.auto,data.frame(horsepower=c(98)),interval="confidence")
predict(lm.auto,data.frame(horsepower=c(98)),interval="prediction")
```

```{r}
attach(Auto)
```
```{r}
plot(horsepower,mpg)
abline(lm.auto,col="red")
```

```{r}
par(mfrow=c(2,2))
plot(lm.auto)
```

Examining the Residuals vs Fitted plot demonstrates that our residuals are tracking our fitted values. This demonstrates an issue of non-linearity of the data. We could potentially resolve this by plotting adding horsepower^2 as the plot of our relationship demonstrates a non-linear relationship. We also appear to have some high leverage points that can potentially be further fleshed out via the studentized residuals. The plot below shows 

```{r}
plot(predict(lm.auto),rstudent(lm.auto),col=ifelse(rstudent(lm.auto)>=3,"red","black"))+
text(predict(lm.auto),rstudent(lm.auto),labels=ifelse(rstudent(lm.auto)>=2.5,names(which(rstudent(lm.auto)>=3)),""),pos=4)
```

```{r}
plot(hatvalues(lm.auto),col=ifelse(hatvalues(lm.auto)>.028,"red","black"))+text(hatvalues(lm.auto),labels=ifelse(hatvalues(lm.auto)>.028,names(which.max(hatvalues(lm.auto)>.028)),""),cex=.7,pos=4)
```

```{r}
pairs(~.,data=Auto)
```

```{r}
cor(Auto[,-9])
```

i.

```{r}
lm.auto2=lm(mpg~.-name,data=Auto)
summary(lm.auto2)
```

ii. Yes, excluding cylinders, horsepower and acceleration. The F-statistic validates the model. The model acounts for about 82% of the variation in mpg.

iii. Every ten years, the average mpg increases by 7.5

d.

```{r}
par(mfrow=c(2,2))
plot(lm.auto2)
```

The data appears to be pretty linear. The residuals seem to be pretty equally distributed along both sides of the line of fit so there does not appear to be evidence of correlation among sequential residuals. We may have some outliers and residuals which are labeled in the plot. 

```{r}
lm.auto3=lm(mpg~.-name+horsepower:acceleration-cylinders-displacement,data=Auto)
summary(lm.auto3)
```

```{r}
lm.auto4=lm(mpg~horsepower+I(horsepower^2),data=Auto)
summary(lm.auto4)
```

```{r}
par(mfrow=c(2,2))
plot(lm.auto4)
```

```{r}
attach(Carseats)
```

10a.

```{r}
summary(Carseats)
contrasts(ShelveLoc)
```

```{r}
lm.cars=lm(Sales~Price+Urban+US,data=Carseats)
summary(lm.cars)
```

b. Price has a negative relationship with Sales. Thus, a one-unit increase in the sale price of carseats results in a reduction in sales by .05. UrbanYes I'm assuming is a measure of whether or not it is sold in an area and while although not statistically significant, the relationship suggests that sales of carseats in urban areas are .02 lower than their baseline counterpart of not in urban areas. I believe USYEs is similar. If it is sold in the US it's sales are 1.2 higher than their baseline of not equivalent or stores in the US do about $1200 better in terms of sales.

c. Sales = 13.04 - .05(Price) - 02(IfUrban) + 1.2(IfUS) __if__ Urban and US "NO" then Sales = 13.04 -.05(Price)

d. Price and USYes

e.

```{r}
lm.cars2=lm(Sales~Price+US,data=Carseats)
summary(lm.cars2)
```

f. After removing "Urban" we note that there was almost no change in the R-Squared value demonstrating that Urban was not useful in explaining variation in sales. Furthermore, we note a greater F-Statistic and the Residual Standard Error actually decreased (although not by much) in our newer model.

g. 
```{r}
confint(lm.cars2)
```

h.
```{r}
par(mfrow=c(2,2))
plot(lm.cars2)
```

The plots look pretty good. No evident issues of non-linearity or correlated residuals, but the plot of residuals vs leverage demonstrates that we likely have some high leverage points.

11.

```{r}
set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)
```

a.
```{r}
lm.equat=lm(y~x+0)
summary(lm.equat)
```

b.
```{r}
lm.equat2=lm(x~y+0)
summary(lm.equat2)
```

c. The results are the same except for the estimates of the coefficients and the residuals. They have the same R-squareds, F-statistics and p-values.

d.

e. 

f.

```{r}
lm.equat=lm(y~x)
summary(lm.equat)
```

```{r}
lm.equat2=lm(x~y)
summary(lm.equat2)
```

12. (Skipped)

13. 

a.

```{r}
set.seed(1)
x=rnorm(100,0,1)
```

b.

```{r}
eps=rnorm(100,0,.25)
```

c.

```{r}
y=-1+.5*x+eps
```

```{r}
length(y)
```

__*B~0~*__ = -1 and __*B~1~*__ = .5

d.
```{r}
scatter.smooth(x,y)
```

The relationship is perfectly linear.

e.
```{r}
lm.equat3=lm(y~x)
summary(lm.equat3)
```

The predicted values of B~0~ and B~1~ are almost identical to the actual values.

f.
```{r}
coef(lm.equat3)
```

```{r}
plot(x,y,pch=20)
abline(coef=c(-1,0.5),col="red")
abline(coef(lm.equat3),col="green")
legend("bottomright",legend=c("Population Regression","Least Square"),col=c("red","green"),lty=c(1,1))
```

g.

```{r}
lm.equat4=lm(y~x+I(x^2))
summary(lm.equat4)
```

No, our predictive power (in terms of R-squared) didn't improve and the strength of our F-statistic decreased. It is still significant, but worse.

14.

```{r}
set.seed(1)
x1=runif(100)
x2=0.5*x1+rnorm(100)/10
y=2+2*x1+.3*x2+rnorm(100)
```

a.

The regression coefficients:
__*B~0~*__ = 2 __*B~1~*__ = 2 __*B~2~*__ = .3

b.

```{r}
cor(x1,x2)
plot(x1,x2,col="black")
```

c.
```{r}
lm.equat4=lm(y~x1+x2)
summary(lm.equat4)
```

__*B~0~*__ = 2.13 __*B~1~*__ = 1.4 __*B~2~*__ = 1

These predicted coefficients are different from their population regression values. The most different is the coefficient of x2 which is likely due to its low level of statistical significance. We can reject the null __H~0~:*B~1~*=0__, but not the null __H~0~:*B~2~*=0__.

d.

```{r}
lm.equat5=lm(y~x1)
summary(lm.equat5)
```

e.
```{r}
lm.equat6=lm(y~x2)
summary(lm.equat6)
```

f. The results obtain in (c)-(e) do not contradict each other. It is entirely possible that two variables are not jointly significant, but can be independently significant. What is likely occurring is that x1 and x2 are highly correlated and thus when the two are included together, x2 does not provide any new information that x1 does not already account for.

g. (Skipped)

__15.__

a.

```{r}
attach(Boston)
names(Boston)
```

```{r}
lm.b1=lm(crim~zn)
summary(lm.b1)
```

```{r}
lm.b2=lm(crim~indus)
summary(lm.b2)
```

```{r}
lm.b3=lm(crim~chas)
summary(lm.b3)
```

```{r}
lm.b4=lm(crim~nox)
summary(lm.b4)
```

```{r}
lm.b5=lm(crim~rm)
summary(lm.b5)
```

```{r}
lm.b6=lm(crim~age)
summary(lm.b6)
```

```{r}
lm.b7=lm(crim~dis)
summary(lm.b7)
```

```{r}
lm.b8=lm(crim~rad)
summary(lm.b8)
```

```{r}
lm.b9=lm(crim~tax)
summary(lm.b9)
```

```{r}
lm.b10=lm(crim~ptratio)
summary(lm.b10)
```

```{r}
lm.b11=lm(crim~black)
summary(lm.b11)
```

```{r}
lm.b12=lm(crim~lstat)
summary(lm.b12)
```


```{r}
lm.b13=lm(crim~medv)
summary(lm.b13)
```

The only simple linear regression model that failed to generate a statistically significant association between the predictor and the response was the relationship between chas and crim.

```{r}
par(mfrow=c(2,2))
plot(zn,crim)
plot(indus,crim)
plot(nox,crim)
plot(rm,crim)
```

b.
```{r}
lm.Boston=lm(crim~.,data=Boston)
summary(lm.Boston)
```

Our F-statistic is pretty low (31.47). The p-value on our F-test indicates a significant model, but we can do better. After accounting for everything in the model we have only explained 44% of the variation in crim. Using a 95% confidence threshold we note that the following variables are not statistically significant: indus, chas, nox, rm, age, tax, ptration and lstat.

c.

In the first model only chas failed as to provide any statistical significance regarding a relationship between the predictor and response. However, when all of our variables are included in the model we note that many more are excluded as viable predictors. This signifies a great deal of collinearity within our model. Fundamentally, many of our variables are highly correlated and thus become redundant when explaining variation in crim.















