Assignment2

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN Classifier is a method to predict qualitative responses that attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability, that is, given a positive integer ‘K’ and a test observation x₀, the KNN classifier first identifies the neighbors ‘K’ points in the training data that are closest to x₀, represented by N₀, then estimates the conditional probability for class j as the fraction of points in N₀ whose response values equal j.

KNN Regression is a non-parametric regression method to predict quantitative responses and for a given a value for ‘K’ and a prediction point x₀, it first identifies the ‘K’ training observations that are closest to x₀, represented by N₀. It then estimates f(x₀) using the average of all the training responses in N₀.

9. This question involves the use of multiple linear regression on the `Auto` data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

Scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

lmAuto <- lm(mpg~.-name, data=Auto)

summary(lmAuto)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

Null Hypothesis (H_o): Model is NOT useful , that’s Beta1 = Beta2 = Beta3 = Beta4 = Beta5 = Beta6 = Beta7 = 0

Alternate Hypothesis (H₁): Model is useful, that’s At least one of the Beta is not 0 (There is a relationship between the predictors and response.)

Since F-statistic p-value(2.2e-16), from the above summary of the linear regression model, is very much smaller than the significance value (0.05), there is an evidence to reject the null hypothesis, which means there is a relationship between the predictors and the response (as at least one of the Beta is not 0).

ii. Which predictors appear to have a statistically significant relationship to the response?

From the above linear regression model summary, we can check each predictors t-statistic p-values, that is t-statistic p-values for displacement, weight, year and origin are much smaller than the significance value (0.05). So, we can reject the null hypothesis (Null hypothesis: No significant relationship with the response). This indicates the predictors displacement, weight, year and origin are having a statistically significant relationship to the response.

iii. What does the coefficient for the year variable suggest?

On average, mpg is predicted to have an increase of 0.750773 when year increases by 1 unit on the condition that all other predictors are fixed, that is, every year the car’s mileage increase by 0.75.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lmAuto)

From the Normal Q-Q plot, we see Wide Right tail and curve upward and from the SQRT(|Standardized residuals|) plot, we see few points above 2.0 point (which indicates few outliers in the observations). Both these indicates the Violation of Normality assumption. From the Residuals plot, we see there is a slight pattern, which indicates a mild non-linearity in the data. Also, there is one high leverage point (observation 14).

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

It is easy to include interaction terms in a linear model using the lm() function. For example, if a and b are predictors and y is response variable, then the syntax a:b tells R to include an interaction term between a and b and the syntax a*b simultaneously includes a, b, and the interaction term a b as predictors; it is a shorthand for a+b+a:b.

From the section 9(a) scatterplot, we can see that displacement and weight are highly correlated (also they are having significant relationship individually with mpg as well). So, we will try the interaction terms between displacement and weight along with all the individual predictors (that is add this interaction term to the linear regression test in section 9(c)) as below.

lmautointeraction1 = lm(mpg ~ . - name + displacement:weight,data=Auto)
summary(lmautointeraction1)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16

From the above linear regression model, we can identify that the p-value is much smaller than the significance value (0.05). This indicates the displacement and weight interaction term has significant relationship with the mpg. Also, we can see that the R² is increased from the linear regression model in section 9(c), which also indicates that the interaction term has significant relationship with the model.

Let’s try the other significant predictors year and origin interaction effect on the model.

lmautointeraction2 = lm(mpg ~ . - name + year:origin,data=Auto)
summary(lmautointeraction2)

## 
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6072 -2.0439 -0.0596  1.7121 12.3368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
## cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
## displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
## horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
## weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
## acceleration  9.185e-02  9.766e-02   0.941 0.347546    
## year          4.189e-01  1.125e-01   3.723 0.000226 ***
## origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
## year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8228 
## F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

From the above linear regression model, we can identify that the p-value is much smaller than the significance value (0.05). This indicates the year and origin interaction term has significant relationship with the mpg. But, we can see that there is a minor increase in R² from the linear regression model in section 9(c), which indicates that the interaction term has less significant relationship with the model as compare to the interaction term displacement and weight.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

From the linear regression model from section 9(c), we see that cylinders, horsepower and acceleration are NOT having significant relationship with mpg from its t-statistic p-value. So, lets try some of the transformations on those variables and apply the linear regression model with those ones.

lmTransformed = lm(mpg ~ . - name + log(horsepower) + sqrt(cylinders) + I(acceleration^2),data=Auto)
summary(lmTransformed)

## 
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + sqrt(cylinders) + 
##     I(acceleration^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3131 -1.6627 -0.0949  1.5201 12.1563 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.710e+01  1.481e+01   6.556 1.80e-10 ***
## cylinders          2.149e+00  2.365e+00   0.909 0.363972    
## displacement      -6.094e-03  7.265e-03  -0.839 0.402078    
## horsepower         1.500e-01  2.621e-02   5.725 2.10e-08 ***
## weight            -3.255e-03  6.665e-04  -4.884 1.54e-06 ***
## acceleration      -1.219e+00  5.523e-01  -2.207 0.027929 *  
## year               7.430e-01  4.523e-02  16.427  < 2e-16 ***
## origin             8.697e-01  2.538e-01   3.427 0.000676 ***
## log(horsepower)   -2.443e+01  2.910e+00  -8.394 9.34e-16 ***
## sqrt(cylinders)   -1.029e+01  1.113e+01  -0.925 0.355777    
## I(acceleration^2)  2.717e-02  1.619e-02   1.678 0.094077 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.951 on 381 degrees of freedom
## Multiple R-squared:  0.8607, Adjusted R-squared:  0.857 
## F-statistic: 235.4 on 10 and 381 DF,  p-value: < 2.2e-16

Even after applying few transformation on the variables that are NOT significant earlier in section 9(c) linear regression model, we found only horsepower has now significant relationship with the mpg. Lets try and check with plots.

plot(log(Auto$horsepower), Auto$mpg)

The above plot indicates that the log transformation of horsepower has a linear relationship with the mpg and its a negative relationship (as it can be identified from the negative coefficient [-2.443e+01] as well).

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

lmSales1 <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lmSales1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

lmSales1$coefficients

## (Intercept)       Price    UrbanYes       USYes 
## 13.04346894 -0.05445885 -0.02191615  1.20057270

Since t-statistics p-values of Price and US (Yes - 1 and No - 0), from the section (a) model, are much smaller than the significance value so they have significant linear relationship with Sales.

Lets check the coefficients and they can be interpreted as below (1000 units of Sales):

Price: On average, 1 dollar increase in Price will decrease 54.45885 units of Sales having all other predictors remain same or fixed.

Urban (Yes = 1 and No = 0): If Yes represents the urban and No represents the rural, then on average the unit Sales in the urban is 21.91615 less than the rural having all other predictors remain fixed.

US (Yes = 1 and No = 0): If Yes represents the US Stores and No represents the non-US stores, then on average the unit Sales in the US Stores is 1200.5727 more than non-US stores having all other predictors remain fixed.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

In general, the model can be written as:

(Sales)Hat = 13.043469 - 0.054459 * Price - 0.021916 * Urban + 1.200573 * US

Where Urban = 1 for Urban store and 0 for rural store and US = 1 for US store and 0 for non-US store.

Urban = 1: the model would be (Sales)Hat = (13.043469 - 0.021916) - 0.054459 * Price + 1.200573 * US, when Price and US remain fixed.

Urban = 0: the model would be (Sales)Hat = 13.043469 - 0.054459 * Price + 1.200573 * US, when Price and US remain fixed.

US = 1: the model would be (Sales)Hat = (13.043469 + 1.200573) - 0.054459 * Price - 0.021916 * Urban, when Price and Urban remain fixed.

US = 0: the model would be (Sales)Hat = 13.043469 - 0.054459 * Price - 0.021916 * Urban, when Price and Urban remain fixed.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Since t-statistics p-values of Price and US (Yes - 1), from the section (a) summary, are much smaller than the significance value (0.05) so they have significant linear relationship with Sales (that is we have evidence to reject null hypothesis).

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Since the p-value from the t-statistic (of section(a) summary) is much larger than the significance value (0.05), we DO NOT have evidence to reject null hypothesis (that is there is not much significant relationship between Sales and Urban). So, we can remove the Urban from the model (from section (a)).

lmUpdatedModel <- lm(Sales ~ Price + US, data = Carseats)
summary(lmUpdatedModel)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Both models has R² is 23.93% that the model explain the variance in Sales. But when the adjusted R² compared between the models (a) and (e), there is a slight increase in the value from 23.35% to 23.54%. This indicates adding additional predictor variable Urban to the model is fitting with the data. So, adding additional variables or predictors NOT always increase the Goodness of fit within the model. In this case, model defined in (e) is the better when compare to the model defined in (a).

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

The 95% confidence intervals for Price and US (Yes = 1) coefficients are below:

confint(lmUpdatedModel)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

For instance, the 95% confidence interval for Price is between -0.06475984 and -0.04419543.

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))
plot(lmUpdatedModel)

From the Residuals vs Leverage plot, there is some evidence that there are few outliers (greater than 2 and less than -2) and also, from the same plot, there is some evidence for the high leverage points which are greater than 0.01 [the value 0.01 value can be calculated using the formula (p+1)/n, i.e., (2+1)/400 -> approximately equal to 0.01].

sprintf("The number of observations in Carseats = %d", nrow(Carseats))

## [1] "The number of observations in Carseats = 400"

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Using the same equation from (3.38),

The Coefficient estimate for the regression of Y onto X is

\[ \hat{\beta}_a = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}x_j^2} \]

The Coefficient estimate for the regression of X onto Y is

\[ \hat{\beta}_b = \frac{\sum_{i = 1}^{n}y_ix_i}{\sum_{j = 1}^{n}y_j^2} = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}y_j^2} \]

If the above 2 coefficient estimates should be same, when the below equation resolves to:

\[ \hat{\beta}_a = \hat{\beta}_b \iff \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}x_j^2} = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}y_j^2} \iff \sum_{j = 1}^{n}x_j^2 = \sum_{j = 1}^{n}y_j^2 \]

So the coefficient estimate for the regression of Y onto X will be same as the coefficient estimate for the regression of X onto Y when the below condition is met:

\[ \sum_{j = 1}^{n}x_j^2 = \sum_{j = 1}^{n}y_j^2 \]

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

Randomly, generate 100 observations for X and construct Y from X with some error terms.

#Random 100 observations for X
set.seed(1)
X <- rnorm(100)
Y <- 3 * X + rnorm(100, sd = 3)

sprintf("Sum of X^2^ is = %f", sum(X^2))

## [1] "Sum of X^2^ is = 81.055093"

sprintf("Sum of Y^2^ is = %f", sum(Y^2))

## [1] "Sum of Y^2^ is = 1539.368887"

\[ \sum_{j = 1}^{n}x_j^2 = 81.055093 \]

\[ \sum_{j = 1}^{n}y_j^2 = 1539.368887 \]

It’s very much evident from the above that \[ \hat{\beta}_a \neq \hat{\beta}_b \]

From the both the sum of squares, we can say that the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X. Also, we can prove this with the below linear regression model summary of X onto Y and Y onto X.

lmY <- lm(Y ~ X + 0)
summary(lmY)

## 
## Call:
## lm(formula = Y ~ X + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7461 -1.9415 -0.5312  1.5167  6.9327 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## X   2.9816     0.3194   9.334  3.1e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.876 on 99 degrees of freedom
## Multiple R-squared:  0.4681, Adjusted R-squared:  0.4627 
## F-statistic: 87.13 on 1 and 99 DF,  p-value: 3.1e-15

lmX <- lm(X ~ Y + 0)
summary(lmX)

## 
## Call:
## lm(formula = X ~ Y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35410 -0.37468  0.09974  0.48799  1.55406 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## Y  0.15700    0.01682   9.334  3.1e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6599 on 99 degrees of freedom
## Multiple R-squared:  0.4681, Adjusted R-squared:  0.4627 
## F-statistic: 87.13 on 1 and 99 DF,  p-value: 3.1e-15

From the both the linear regression model summary, we see that the estimated coefficients are different (Coefficient estimate of X = 2.9816 and the coefficient estimate of Y = 0.15700).

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

Randomly, generate 100 observations for X and set the same observations to Y.

#Random 100 observations for X
set.seed(2)
X <- rnorm(100)
Y <- X

sprintf("Sum of X^2^ is = %f", sum(X^2))

## [1] "Sum of X^2^ is = 133.352149"

sprintf("Sum of Y^2^ is = %f", sum(Y^2))

## [1] "Sum of Y^2^ is = 133.352149"

\[ \sum_{j = 1}^{n}x_j^2 = 133.352149 \]

\[ \sum_{j = 1}^{n}y_j^2 = 133.352149 \]

It’s very much evident from the above that \[ \hat{\beta}_a = \hat{\beta}_b \]

From the both the sum of squares, we can say that the coefficient estimate for the regression of X onto Y is same as the coefficient estimate for the regression of Y onto X. Also, we can prove this with the below linear regression model summary of X onto Y and Y onto X.

lmYSame <- lm(Y ~ X + 0)
summary(lmYSame)

## Warning in summary.lm(lmYSame): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Y ~ X + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.024e-16 -1.308e-17  7.990e-18  4.566e-17  2.532e-15 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## X 1.000e+00  2.287e-17 4.373e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.913e+33 on 1 and 99 DF,  p-value: < 2.2e-16

lmXSame <- lm(X ~ Y + 0)
summary(lmXSame)

## Warning in summary.lm(lmXSame): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = X ~ Y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.024e-16 -1.308e-17  7.990e-18  4.566e-17  2.532e-15 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## Y 1.000e+00  2.287e-17 4.373e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.913e+33 on 1 and 99 DF,  p-value: < 2.2e-16

From the both the linear regression model summary, we see that the estimated coefficients are same (Coefficient estimate of X = 1.000e+00 and the coefficient estimate of Y = 1.000e+00).

Assignment2

Alexis Chinnappan (UTSA ID: sez126)

6/14/2021

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

9. This question involves the use of multiple linear regression on the `Auto` data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

Assignment2

Alexis Chinnappan (UTSA ID: sez126)

6/14/2021

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

10. This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

9. This question involves the use of multiple linear regression on the `Auto` data set.

10. This question should be answered using the `Carseats` data set.

(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.