Assignment 2

Applied Questions

Question 9

This question involves the use of multiple linear regression on the Auto data set.

9a

Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

9b

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative

cor(Auto[, names(Auto) !="name"])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

corrauto<-cor(Auto[, names(Auto) !="name"])
corrplot(corrauto, method = "number")

9c

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

mpgReg <- lm(mpg~. -name, data = Auto)
summary(mpgReg)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

RESPONSE:

Simply put, yes, there is a relationship between the predictors and the response variable ‘mpg’. We see that some predictors are more significant than others. This output has a code to conveniently let us know which predictors carry higher significance, by the ’ *** ’ scale. We can also make these inferences by analyzing the p-values, where the p-values typically above .05 are not statistically significant. We also see that the model produced an R-squared value of .8215 meaning that 82.15% of the variance in ‘mpg’ can be explained by our x variables or our predictor variables.
As previously stated, the ouput allows us to visusalize the significant predictors. It is important to note that the intercept is not to be considered. We can also look at p-values that will show us that ‘displacement’,‘weight’, ‘year’,and ‘origin’ have a statistically significant relationship with the response variable, in this case ‘mpg’.
Recall that the coefficients are the estimated functions of our predictor variables, where β0 is the intersept and β1 is the slope or an unknown function of X. This can also be explained as the average increase in Y associated with one unit increase in X when all other predictors are constant. The true relationship is generally not known so we use these estimates to make our inferences. For the coefficient for the variable ‘year’, we can say that ‘mpg’ increases about .75 times per year with all other variables held constant.

9d

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(mpgReg)

The following are some observations that the diagnostic plots can tell us.

The upper left graph shows the fitted regression as well as the residual and shows a non-linear relationship.
The upper right graph shows the Q-Q plot which allows us to make inferences on the distribution of the residuals which we see are somewhat normal but may be right skewed.
The bottom left graph seems to sho a violation of the hederscodastic assumption that is made in a linear regression, we see that there is the forming of a funnel like shape in the residuals, where the variance in rhe residuals seems to be greater as the fitted values increase.
The bottom right graph helps us visualize any outliers or highly levered pointds that could impact our estimated regression line. We see that there is in fact some outliers, and not many but definetly one potential point that is highly levered(point14).

9e

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

mpgReg1 <- lm(mpg~. -name+displacement:weight, data = Auto)
summary(mpgReg1)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16

Adding an the interaction of ‘displacement’ by ‘weight’, we see that the interaction effect is significant, and we even see a higher R-squared at .8588!

mpgReg2 <- lm(mpg~. -name+year:origin, data = Auto)
summary(mpgReg2)

## 
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6072 -2.0439 -0.0596  1.7121 12.3368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
## cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
## displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
## horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
## weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
## acceleration  9.185e-02  9.766e-02   0.941 0.347546    
## year          4.189e-01  1.125e-01   3.723 0.000226 ***
## origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
## year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8228 
## F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

The interaction effect of ‘year’ by ’origin’ while significant it is not nearly as significant as the previous model, and the R-squared is little changed.

mpgReg3 <- lm(mpg~. -name + cylinders*displacement, data = Auto)
summary(mpgReg3)

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6081  -1.7833  -0.0465   1.6821  12.2617 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.7096590  4.6858582  -0.578 0.563426    
## cylinders              -2.6962123  0.4094916  -6.584 1.51e-10 ***
## displacement           -0.0774797  0.0141535  -5.474 7.96e-08 ***
## horsepower             -0.0476026  0.0133736  -3.559 0.000418 ***
## weight                 -0.0052339  0.0006253  -8.370 1.10e-15 ***
## acceleration            0.0597997  0.0918038   0.651 0.515188    
## year                    0.7594500  0.0473354  16.044  < 2e-16 ***
## origin                  0.7087399  0.2736917   2.590 0.009976 ** 
## cylinders:displacement  0.0136081  0.0017209   7.907 2.84e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.089 on 383 degrees of freedom
## Multiple R-squared:  0.8465, Adjusted R-squared:  0.8433 
## F-statistic: 264.1 on 8 and 383 DF,  p-value: < 2.2e-16

As is the case with the other two, this interaction of ‘cylinders’ by ‘replacement’ also shows to be statistically significant.

9f

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

AutoTransform <- subset( Auto, select = -name )
plot(Auto)

plot(log(AutoTransform),AutoTransform$mpg)

plot(sqrt(AutoTransform),AutoTransform$mpg)

plot((AutoTransform)^2, AutoTransform$mpg)

par(mfrow = c(2,2))
AutoLog<- (lm(log(mpg)~ + cylinders + displacement + horsepower + weight + acceleration + year + origin , data = Auto))
plot(AutoLog)

From the transormation of the variables, it seems as if the log transformation creates a more linear relationship, and seems to fix the heteroscedasticity we were seeing earlier, that is the model seems to now be homoscedastic or equal variance.

Question 10

This question should be answered using the Carseats data set.

10a

Fit a multiple regression model to predict Sales using Price, Urban, and US

data("Carseats")
salesRegression <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(salesRegression)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

RESPONSE:

Intercept: This is the expected value of Y when X = 0
Price: 1 unit increase in Price can relate to a decrease of about 54.459 units given all other predictors are held constant. It is important to note that the carseats data set define sales as unit sales(in thousands) so we can interpret the coefficeient with the same measurement.
UrbanYes: This is one of the Qualitative predictors that can be interprted as on average being 21.916 units less if they are in an urban location where a factor with levels no and yes indicate whther the store is in an urban or rural location, with urban being 1 and rural 0.
USYes: The other Qualitative predictor can be interprted as on average being 1200 units more in sales if the store is in the US. Where a factor with levels no and yes indicate whther the store is in the US or not, with yes being 1 and no 0.

10c

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469 + (-.054459) X Price +(-.021916) X Urban + 1.200573 X US +ε

where urban =1 if the store is in an urban location else urban = 0

where US = 1 if the store is in the US else US = 0

10d

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

RESPONSE: Price and USYes

10e

On the basis of your response to the previous question, fit smaller model that only uses the predictors for which there is evidence of association with the outcome.

salesRegression2 <- lm(Sales ~ Price + US, data = Carseats)
summary(salesRegression2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f

How well do the models in (a) and (e) fit the data?

RESPONSE: In my opinion there is not much change in the way the models fit the data, with the R-squared virtually unchanged from the bigger model to the smaller model.

10g

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(salesRegression2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

10h

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(salesRegression2)

RESPONSE: yes, by looking at the bottom right graph from the diagnostics plots, we do see what appears to be a highly levered point in our data.

Question 12

This problem involves simple linear regression without an intercept.

12a

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

RESPONSE: when the fitted values are the same as the response values, or the residuals

Intercept : B_0 = ybar - B_1 xbar

B_1 = r(sy/sx) ^

coorelation coefficient

y = B_0 + B_1x == x=B_0 + B_1y

sx = sy = 1

12b

*Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- 1:100
sum(x^2)

## [1] 338350

y <- 2 * x + rnorm(100, sd = 0.1)
sum

## function (..., na.rm = FALSE)  .Primitive("sum")

fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223590 -0.062560  0.004426  0.058507  0.230926 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 2.0001514  0.0001548   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

12c

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- 1:100
y <- 100:1
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Assignment 2

Joe Negrete

6/8/2020

Conceptual Questions

Question 2

Applied Questions

Question 9

9a

9b

9c

9d

9e

9f

Question 10

10a

10b

10c

10d

10e

10f

10g

10h

Question 12

12a

12b

12c