## corrplot 0.84 loaded
Carefully explain the differences between the KNN classifier and KNN regression methods.
RESPONSE: the KNN classifier tries to predict which class the output variable belongs to by computing local probabilities that is it tries to solve a classification problem, while the KNN regression attempts to predict the value of the output variable by using local average to solve a regression problem.
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
plot(Auto)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
corrauto<-cor(Auto[, names(Auto) !="name"])
corrplot(corrauto, method = "number")
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
mpgReg <- lm(mpg~. -name, data = Auto)
summary(mpgReg)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
RESPONSE:
Simply put, yes, there is a relationship between the predictors and the response variable ‘mpg’. We see that some predictors are more significant than others. This output has a code to conveniently let us know which predictors carry higher significance, by the ’ *** ’ scale. We can also make these inferences by analyzing the p-values, where the p-values typically above .05 are not statistically significant. We also see that the model produced an R-squared value of .8215 meaning that 82.15% of the variance in ‘mpg’ can be explained by our x variables or our predictor variables.
As previously stated, the ouput allows us to visusalize the significant predictors. It is important to note that the intercept is not to be considered. We can also look at p-values that will show us that ‘displacement’,‘weight’, ‘year’,and ‘origin’ have a statistically significant relationship with the response variable, in this case ‘mpg’.
Recall that the coefficients are the estimated functions of our predictor variables, where β0 is the intersept and β1 is the slope or an unknown function of X. This can also be explained as the average increase in Y associated with one unit increase in X when all other predictors are constant. The true relationship is generally not known so we use these estimates to make our inferences. For the coefficient for the variable ‘year’, we can say that ‘mpg’ increases about .75 times per year with all other variables held constant.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(mpgReg)
The following are some observations that the diagnostic plots can tell us.
The upper left graph shows the fitted regression as well as the residual and shows a non-linear relationship.
The upper right graph shows the Q-Q plot which allows us to make inferences on the distribution of the residuals which we see are somewhat normal but may be right skewed.
The bottom left graph seems to sho a violation of the hederscodastic assumption that is made in a linear regression, we see that there is the forming of a funnel like shape in the residuals, where the variance in rhe residuals seems to be greater as the fitted values increase.
The bottom right graph helps us visualize any outliers or highly levered pointds that could impact our estimated regression line. We see that there is in fact some outliers, and not many but definetly one potential point that is highly levered(point14).
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
mpgReg1 <- lm(mpg~. -name+displacement:weight, data = Auto)
summary(mpgReg1)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
Adding an the interaction of ‘displacement’ by ‘weight’, we see that the interaction effect is significant, and we even see a higher R-squared at .8588!
mpgReg2 <- lm(mpg~. -name+year:origin, data = Auto)
summary(mpgReg2)
##
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6072 -2.0439 -0.0596 1.7121 12.3368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.492e+00 9.044e+00 0.939 0.348353
## cylinders -5.042e-01 3.192e-01 -1.579 0.115082
## displacement 1.567e-02 7.530e-03 2.081 0.038060 *
## horsepower -1.399e-02 1.364e-02 -1.025 0.305786
## weight -6.352e-03 6.449e-04 -9.851 < 2e-16 ***
## acceleration 9.185e-02 9.766e-02 0.941 0.347546
## year 4.189e-01 1.125e-01 3.723 0.000226 ***
## origin -1.405e+01 4.699e+00 -2.989 0.002978 **
## year:origin 1.989e-01 6.030e-02 3.298 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8228
## F-statistic: 227.9 on 8 and 383 DF, p-value: < 2.2e-16
The interaction effect of ‘year’ by ’origin’ while significant it is not nearly as significant as the previous model, and the R-squared is little changed.
mpgReg3 <- lm(mpg~. -name + cylinders*displacement, data = Auto)
summary(mpgReg3)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6081 -1.7833 -0.0465 1.6821 12.2617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7096590 4.6858582 -0.578 0.563426
## cylinders -2.6962123 0.4094916 -6.584 1.51e-10 ***
## displacement -0.0774797 0.0141535 -5.474 7.96e-08 ***
## horsepower -0.0476026 0.0133736 -3.559 0.000418 ***
## weight -0.0052339 0.0006253 -8.370 1.10e-15 ***
## acceleration 0.0597997 0.0918038 0.651 0.515188
## year 0.7594500 0.0473354 16.044 < 2e-16 ***
## origin 0.7087399 0.2736917 2.590 0.009976 **
## cylinders:displacement 0.0136081 0.0017209 7.907 2.84e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.089 on 383 degrees of freedom
## Multiple R-squared: 0.8465, Adjusted R-squared: 0.8433
## F-statistic: 264.1 on 8 and 383 DF, p-value: < 2.2e-16
As is the case with the other two, this interaction of ‘cylinders’ by ‘replacement’ also shows to be statistically significant.
Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
AutoTransform <- subset( Auto, select = -name )
plot(Auto)
plot(log(AutoTransform),AutoTransform$mpg)
plot(sqrt(AutoTransform),AutoTransform$mpg)
plot((AutoTransform)^2, AutoTransform$mpg)
par(mfrow = c(2,2))
AutoLog<- (lm(log(mpg)~ + cylinders + displacement + horsepower + weight + acceleration + year + origin , data = Auto))
plot(AutoLog)
From the transormation of the variables, it seems as if the log transformation creates a more linear relationship, and seems to fix the heteroscedasticity we were seeing earlier, that is the model seems to now be homoscedastic or equal variance.
This question should be answered using the Carseats data set.
Fit a multiple regression model to predict Sales using Price, Urban, and US
data("Carseats")
salesRegression <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(salesRegression)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
RESPONSE:
Intercept: This is the expected value of Y when X = 0
Price: 1 unit increase in Price can relate to a decrease of about 54.459 units given all other predictors are held constant. It is important to note that the carseats data set define sales as unit sales(in thousands) so we can interpret the coefficeient with the same measurement.
UrbanYes: This is one of the Qualitative predictors that can be interprted as on average being 21.916 units less if they are in an urban location where a factor with levels no and yes indicate whther the store is in an urban or rural location, with urban being 1 and rural 0.
USYes: The other Qualitative predictor can be interprted as on average being 1200 units more in sales if the store is in the US. Where a factor with levels no and yes indicate whther the store is in the US or not, with yes being 1 and no 0.
Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.043469 + (-.054459) X Price +(-.021916) X Urban + 1.200573 X US +ε
where urban =1 if the store is in an urban location else urban = 0
where US = 1 if the store is in the US else US = 0
For which of the predictors can you reject the null hypothesis H0 : βj = 0?
RESPONSE: Price and USYes
On the basis of your response to the previous question, fit smaller model that only uses the predictors for which there is evidence of association with the outcome.
salesRegression2 <- lm(Sales ~ Price + US, data = Carseats)
summary(salesRegression2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data?
RESPONSE: In my opinion there is not much change in the way the models fit the data, with the R-squared virtually unchanged from the bigger model to the smaller model.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(salesRegression2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(salesRegression2)
RESPONSE: yes, by looking at the bottom right graph from the diagnostics plots, we do see what appears to be a highly levered point in our data.
This problem involves simple linear regression without an intercept.
Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
RESPONSE: when the fitted values are the same as the response values, or the residuals
Intercept : B_0 = ybar - B_1 xbar
B_1 = r(sy/sx) ^
|
coorelation coefficient
y = B_0 + B_1x == x=B_0 + B_1y
sx = sy = 1
*Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.1)
sum
## function (..., na.rm = FALSE) .Primitive("sum")
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 1:100
y <- 100:1
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08