Assignment 2: Linear Regression

Problem 2:

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN Classifier - Non-parametric approach that is used to solve classification problem or attempts to predict qualitative responses. KNN classifier is solved by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j.

KNN Regression - Non-parametric method that is used to solve regression problems or attempts to predict quantitative responses. KNN regression is solved by identifying the K training observations that are closest to x0 (represented by N0), and then estimating f(x0) as the average of all the training responses in the “neighborhood.”

Problem 9:

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
Auto = read.csv("C:/Users/selen/OneDrive/Documents/Summer 2020 - MSDA/DA 6543 Algorithms II/Data/Auto.csv", header=T, na.strings="?")
Auto = na.omit(Auto)
summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int  33 127 331 337 355
##   ..- attr(*, "names")= chr  "33" "127" "331" "337" ...

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Upon reviewing the correlation matrix, there appears to be at least a moderate assosication between almost all of the variables. Notably, we can see there is a persistently strong association between mpg, cylinders, displacement, horsepower, and weight.

auto.cor = cor(Auto[,names(Auto) !="name"])
auto.cor

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

attach(Auto)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

auto.lm = lm(mpg~.-name, data=Auto)
summary(auto.lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

Looking at the F-Statistic’s p-value of <2.2e-16 (close to 0 value), there is strong evidence that at least one of the predictiors is associated with our response variable, Miles per Gallon (mpg). With such a small F-Statistic p-value of <2.2e-16, we have sufficient evidence to reject the $H_0$ ($\beta_1$ = $\beta_2$ = $\beta_p$ = 0) and conclude that at least 1 predictor is related to the response. Furthermore, we can see that as $H_a$ is true the F-Statistic of 252.4 is greater than 1.

Which predictors appear to have a statistically significant relationship to the response?

Looking at the T-Test p-values, it appears that displacement, weight, year and origin have a statistically significant relationship to the response variable, Miles per Gallon (mpg), as the p-values are close to 0.

What does the coefficient for the year variable suggest?

The coefficient of the year variable suggests that the average effect of an increase of 1 year results in cars becoming more fuel efficient by approximately 0.750773 mpg holding all other predictors fixed.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Plot 1: Residuals vs Fitted

The residuals exhibit a slight U-shape. The U-shape of residuals suggest non-linearity in the data.
There is evidence that suggest heteroscedasticity (non-constant variances in the errors) as a funnel shape (variance of the error terms increase with the value of the response) is apparent.
R has detected observations 323, 326, and 327 as unusally large outliers in the data as seen in the top, right-hand corner of the plot.

Plot 2: Normal Q-Q

The top-right hand corner of the Q-Q Plot suggests that the data may be slightly right-skewed.
Similar to the findings of the Residuals vs Fitted plot, the Q-Q Plot identifies observations 323, 326, and 327 as unusally large outliers.

Plot 3: Scale-Location

The Standardized Residuals vs Fitted Values further emphasize that observations 323, 326, and 327 may be unusually large outliers with studentized residuals close to 2.

Plot 4: Residuals vs Leverage

The leverage plot particularly identifies observation 14 to have high leverage, but low standardized residual.

par(mfrow = c(1,1))
plot(auto.lm)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

For this problem, I ran 4 linear regression models with 2 of the models having interaction effects included. The goal of running the 4 linear regression (2 of which do not have interaction effects and 2 with interaction effects) is to test whether the additive assumption is realistic in this case. To measure the effectiveness of the interaction effects, I will refer to the change of $R^2$ as an indicator.

The first model that I included is the linear regression model with the top three highest correlated pairs (displacement & cylinders, displacement & weight, and displacement & horsepower) without interaction effects.

The second model included the top three highest correlated pairs (displacement & cylinders, displacement & weight, and displacement & horsepower), but with interaction effects. For this model, only the interaction term for displacement:horsepower was deemed statistically significant while displacement:cylinders and displacement:weight were not.

The third model include the top two highest correlated pairs (displacement & cylinders and displacement & weight) without interaction effects.

The third model include the top two highest correlated pairs (displacement & cylinders and displacement & weight) with interaction effects. For this model, only the interaction term for displacement:weight is statistically significant.

Based on the reviews of the $R^2$ values for the first and second set of linear regression models, it appears that the introduction of interaction terms results in an increase in $R^2$ which suggests that it may be best to relax the addictive assumption of linear regression by including interaction terms.

auto.lm2 = lm(mpg~displacement + weight + cylinders + horsepower, data = Auto)

auto.lm2i = lm(mpg~displacement + weight + cylinders + horsepower + displacement:cylinders + displacement:weight + displacement:horsepower, data = Auto)

summary(auto.lm2)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + horsepower, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5248  -2.7964  -0.3568   2.2577  16.3221 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.7567705  1.5200437  30.102  < 2e-16 ***
## displacement  0.0001389  0.0090099   0.015 0.987709    
## weight       -0.0052772  0.0007166  -7.364 1.08e-12 ***
## cylinders    -0.3932854  0.4095522  -0.960 0.337513    
## horsepower   -0.0428125  0.0128699  -3.327 0.000963 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.242 on 387 degrees of freedom
## Multiple R-squared:  0.7077, Adjusted R-squared:  0.7046 
## F-statistic: 234.2 on 4 and 387 DF,  p-value: < 2.2e-16

summary(auto.lm2i)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + horsepower + 
##     displacement:cylinders + displacement:weight + displacement:horsepower, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1308  -2.1597  -0.3652   1.9001  16.9864 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.584e+01  2.569e+00  21.733  < 2e-16 ***
## displacement            -9.524e-02  1.605e-02  -5.935 6.59e-09 ***
## weight                  -3.803e-03  1.589e-03  -2.394   0.0172 *  
## cylinders                3.330e-01  8.190e-01   0.407   0.6845    
## horsepower              -1.844e-01  2.855e-02  -6.460 3.18e-10 ***
## displacement:cylinders   1.569e-03  3.581e-03   0.438   0.6615    
## displacement:weight      4.258e-06  5.555e-06   0.766   0.4439    
## displacement:horsepower  4.238e-04  9.786e-05   4.331 1.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.865 on 384 degrees of freedom
## Multiple R-squared:  0.7591, Adjusted R-squared:  0.7547 
## F-statistic: 172.9 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm3 = lm(mpg~displacement + weight + cylinders, data = Auto)

auto.lm3i = lm(mpg~displacement + weight + cylinders + displacement*cylinders + displacement*weight, data=Auto)

summary(auto.lm3)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5568  -2.8703  -0.3649   2.2708  16.4338 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.3709616  1.4806851  29.967  < 2e-16 ***
## displacement -0.0126740  0.0082501  -1.536    0.125    
## weight       -0.0057079  0.0007139  -7.995  1.5e-14 ***
## cylinders    -0.2677968  0.4130673  -0.648    0.517    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.297 on 388 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.697 
## F-statistic: 300.8 on 3 and 388 DF,  p-value: < 2.2e-16

summary(auto.lm3i)

## 
## Call:
## lm(formula = mpg ~ displacement + weight + cylinders + displacement * 
##     cylinders + displacement * weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement:cylinders -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Try a few different transformations of the variables, such as log(X),√X, X2. Comment on your findings.

For this problem, I will try to transform the least significant variable with $log(x)$,$√x$, and $x^2$ transformations.

As you may recall, the base linear regression model (auto.lm) resulted in acceleration with a p-value of 0.41548. As acceleration had the highest p-value, I began my transformation with this variable. Upon using log and square-root transformations, I see that the p-value worsens are it increases to 0.9368 and 0.70343 respectively. The squared transformation significantly improves the significance of acceleration as the p-value drops close to 0.

Speaking for the $x^2$ transformation of acceleration we can see that the $R^2$ slightly increases.

summary(auto.lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm4 = lm(mpg~cylinders + displacement + horsepower + weight + log(acceleration) + year + origin, data = Auto)

summary(auto.lm4)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     log(acceleration) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7774 -2.1790 -0.1636  1.8434 13.1268 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -15.174273   6.443614  -2.355   0.0190 *  
## cylinders          -0.507167   0.323203  -1.569   0.1174    
## displacement        0.019166   0.007595   2.524   0.0120 *  
## horsepower         -0.024622   0.014198  -1.734   0.0837 .  
## weight             -0.006190   0.000676  -9.157  < 2e-16 ***
## log(acceleration)  -0.129499   1.631402  -0.079   0.9368    
## year                0.747224   0.050993  14.654  < 2e-16 ***
## origin              1.428083   0.278370   5.130  4.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.331 on 384 degrees of freedom
## Multiple R-squared:  0.8212, Adjusted R-squared:  0.8179 
## F-statistic: 251.9 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm5 = lm(mpg~cylinders + displacement + horsepower + weight + sqrt(acceleration) + year + origin, data = Auto)

summary(auto.lm5)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     sqrt(acceleration) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6792 -2.1496 -0.1413  1.8603 13.0920 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.696e+01  5.556e+00  -3.052  0.00243 ** 
## cylinders          -5.022e-01  3.233e-01  -1.553  0.12117    
## displacement        1.966e-02  7.550e-03   2.604  0.00958 ** 
## horsepower         -2.052e-02  1.401e-02  -1.464  0.14395    
## weight             -6.347e-03  6.639e-04  -9.560  < 2e-16 ***
## sqrt(acceleration)  3.086e-01  8.101e-01   0.381  0.70343    
## year                7.490e-01  5.100e-02  14.687  < 2e-16 ***
## origin              1.428e+00  2.783e-01   5.131 4.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33 on 384 degrees of freedom
## Multiple R-squared:  0.8212, Adjusted R-squared:  0.818 
## F-statistic:   252 on 7 and 384 DF,  p-value: < 2.2e-16

auto.lm6 = lm(mpg~cylinders + displacement + horsepower + weight + acceleration + I(acceleration^2) + year + origin, data = Auto)

summary(auto.lm6)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + I(acceleration^2) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9680 -1.9266 -0.0124  1.9153 13.2722 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.1088174  6.4930423   0.787   0.4319    
## cylinders         -0.3181584  0.3165577  -1.005   0.3155    
## displacement       0.0090446  0.0076528   1.182   0.2380    
## horsepower        -0.0346411  0.0139094  -2.490   0.0132 *  
## weight            -0.0054113  0.0006719  -8.053 1.03e-14 ***
## acceleration      -2.6374431  0.5758788  -4.580 6.30e-06 ***
## I(acceleration^2)  0.0790472  0.0165131   4.787 2.42e-06 ***
## year               0.7535781  0.0495815  15.199  < 2e-16 ***
## origin             1.3265929  0.2713219   4.889 1.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.237 on 383 degrees of freedom
## Multiple R-squared:  0.8316, Adjusted R-squared:  0.828 
## F-statistic: 236.3 on 8 and 383 DF,  p-value: < 2.2e-16

However, even with the transformation of acceleration we are still breaching some of the linear regression assumptions -

The residuals vs fitted plot indicates heteroskedasticity (non-constant variance over mean) in the model.
The Q-Q plot indicates somewhat abnormality of the residuals as there is skewness visible on the right side of the plot.

Additionally, the same issue of unusually large outliers and high leverage points are persistent as seen in the residual and leverage plots.

To conclude, a better transformation to the data may be required to improve the soundness and fit of our model.

plot(auto.lm6)

detach(Auto)

Problem 10:

This question should be answered using the Carseats data set.

library(ISLR)
fix(Carseats)
summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

attach(Carseats)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

car.lm = lm(Sales~Price+Urban+US, data = Carseats)
summary(car.lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Price: The $\beta$ value for Price of -0.0545 can be interpreted as average effect on Carseat Sales with a one unit increase in Price, holding all other variables fixed.
UrbanYes: The $\beta_0$ of 13.043 can be interpreted as the overall Carseats Sales among non-Urban areas. $\beta_0 + \beta_{UrbanYes}$ is the averge Sales for Carseats in Urban areas, and $\beta_{UrbanYes}$ is the average difference in Carseats Sales between Urban and non-Urban areas. Further considering the high p-value of UrbanYes suggests that there is no statistical evidence of a difference in Carseat Sales between Urban and non-Urban areas. This interpretation does imply that we are holding all other variables fixed.
USYes: The $\beta_0$ of 13.043 can be interpreted as the overall Carseats Sales among non-US stores. $\beta_0 + \beta_{USYes}$ is the averge Sales for Carseats in stores located in the US, and $\beta_{USYes}$ is the average difference in Carseats Sales between stores located in the US or not. This interpretation does imply that we are holding all other variables fixed.

Write out the model in equation form, being careful to handle the qualitative variables properly.

$\hat{Sales}$ = 13.043469 - (0.054459 * Price) - (0.021916 * UrbanYes) + (1.200573 * USYes)

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

We can reject the null hypothesis $H_0$: $\beta_j$ = 0 for variables Price & USYes as the p-values are close to zero. With the p-value being close to 0 indicates clear evidence of relationship between Price & USYes with Sales.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

car.lm2 = lm(Sales~Price+US, data = Carseats)
summary(car.lm2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

The RSE measure for the model with Sales~Price+Urban+US is 2.472. In other words, the actual Sales in carseats in each store will deviate from the true regression line by approximately $2,472, on average. The $R^{2}$ for this model is 0.2393 meaning that, on average, 23.93% of the variance in Sales can be explained by Price, Urban and US.

For Sales~Price+US, we see a slightly lower RSE value of 2.469. The RSE value explains that the actual Sales in carseats in each store will deviate from the true regression line by approximately $2,469, on average. Like the model that includes the Urban variable, the $R^{2}$ is 0.2393. The $R^{2}$ value of 0.2393 means that, 23.93% of variance in Sales can be explained by Price and US variables.

Neither of the models fit that data well considering the small $R^{2}$ value and high RSE. It may be best to consider a non-linear regression model, interaction terms, or even some type of transformations of the predictor variables.

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(car.lm2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

The Residuals vs Fitted, Q-Q, and Scale-Location Plots show evidence that observations 51, 69 and 377 may be potential outliners. Observation 368 is a high leverage point. However, the Residuals vs Leverage plot does indicate observations 26 and 50 as notable points, but I believe this is due to these observations having high standardized residuals.

plot(car.lm2)

detach(Carseats)

Problem 12:

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate $\hat{\beta}$ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate of the regression of Y onto X is - \[\hat{\beta} = \frac{\sum_{i=1}^{n}x_{i}y_i}{\sum_{j=1}^{n}x^{2}_{j}}\]

The coefficient estimate of the regression of X onto Y is - \[\hat{\beta'} = \frac{\sum_{i=1}^{n}x_{i}y_i}{\sum_{j=1}^{n}y^{2}_{j}}\]

The coefficient estimate for the regression of X onto Y will be the same as the coefficient estimate for the regression Y onto X when the sum of squares of the observed y-values are equal to the sum of squares of the observed x-values. We are abole to find this true as the $\beta$ estimates are extremely similar with the exception of the denominator.

\[{\sum_{j=1}^{n}x^{2}_{j}} = {\sum_{j=1}^{n}y^{2}_{j}}\]

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

As seen below, the sum of squares for X & Y are different. In this case, the regression of X onto Y and Y onto X are expected to have different coefficient estimates.

set.seed(12)
x = rnorm(100)
y = 2*x
sum(x^2)

## [1] 74.18993

sum(y^2)

## [1] 296.7597

lm.fit = lm(y~x+1)
lm.fit2 = lm(x~y+0)
summary(lm.fit)

## Warning in summary.lm(lm.fit): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y ~ x + 1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.404e-15 -1.100e-17  5.550e-17  1.754e-16  5.476e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -4.441e-17  7.756e-17 -5.730e-01    0.568    
## x            2.000e+00  9.005e-17  2.221e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.751e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.933e+32 on 1 and 98 DF,  p-value: < 2.2e-16

summary(lm.fit2)

## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.474e-16 -3.990e-17 -3.100e-18  3.530e-17  4.977e-15 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 5.000e-01  2.964e-17 1.687e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.106e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.845e+32 on 1 and 99 DF,  p-value: < 2.2e-16

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

As seen below, the sum of squares for both X & Y are the same; thus, we can expect for the coefficient estimates for the regresion of X onto Y to be the same as the coefficient estimate for the regression of Y onto X.

set.seed(12)
x = rnorm(100)
y = - sample(x, 100)
sum(x^2)

## [1] 74.18993

sum(y^2)

## [1] 74.18993

lm.fit.x = lm(y~x+0)
lm.fit.y = lm(x~y+0)
summary(lm.fit.x)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1608 -0.4869  0.1308  0.5857  2.1343 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.04395    0.10041   0.438    0.663
## 
## Residual standard error: 0.8648 on 99 degrees of freedom
## Multiple R-squared:  0.001931,   Adjusted R-squared:  -0.00815 
## F-statistic: 0.1916 on 1 and 99 DF,  p-value: 0.6626

summary(lm.fit.y)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2370 -0.5915 -0.1117  0.4835  2.1114 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.04395    0.10041   0.438    0.663
## 
## Residual standard error: 0.8648 on 99 degrees of freedom
## Multiple R-squared:  0.001931,   Adjusted R-squared:  -0.00815 
## F-statistic: 0.1916 on 1 and 99 DF,  p-value: 0.6626

Selena Romero - Assignment 2 R Code - Summer 2020

Selena Romero

6/4/2020

Assignment 2: Linear Regression

Problem 2:

Problem 9:

Problem 10:

Problem 12: