Machine Learning Problem Set 1

Hannah Hon

auto = read.table('https://www-bcf.usc.edu/~gareth/ISL/Auto.data',header = T, na.strings = '?')
auto$origin = factor(auto$origin, 1:3, c('US','Europe','Japan'))
head(auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70     US
## 2  15         8          350        165   3693         11.5   70     US
## 3  18         8          318        150   3436         11.0   70     US
## 4  16         8          304        150   3433         12.0   70     US
## 5  17         8          302        140   3449         10.5   70     US
## 6  15         8          429        198   4341         10.0   70     US
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
fit = lm(mpg ~ horsepower, auto)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
plot(fit)

abline(fit)

## a.mpg = 39.935861 - 0.157845horsepower
## b. The slope shows that as horsepower decrease by 1 unit, mpg decrease by 0.158 unit on average.
## c. The standard error of the slope is 0.717.
## d. The residual standard error shows that the true horsepower is about 4.906 away from the predicted horsepower.
## e. There is a significant relationship between horsepower and mpg, because the p-value is less than 0.05 and F test is significant.
## f. 60.59% variation in mpg can be explained by horsepower.
predict(fit, data.frame(horsepower = 98))
##        1 
## 24.46708
predict(fit, data.frame(horsepower = 98), interval="prediction", level = 0.95)
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
predict(fit, data.frame(horsepower = 98), interval="confidence", level = 0.99)
##        fit      lwr      upr
## 1 24.46708 23.81669 25.11747
confint(fit, 'horsepower', level = 0.90)
##                   5 %       95 %
## horsepower -0.1684719 -0.1472176
## g.39.936 - 0.158 * 98 = 24.47
## h. the 95% prediction interval is (14.8094, 34.12476)
## i. the 99% confidence interval is (23.82, 25.12)
## j. the 90% confidence interval for the slope is (-0.1684719, -0.1472176)
## k. There is non-linearity in the data, but still homoscedastic.
Question 2
plot(auto, pch = '.') 

round(cor(auto[,1:7], use = 'pair'),4)
##                  mpg cylinders displacement horsepower  weight
## mpg           1.0000   -0.7763      -0.8044    -0.7784 -0.8317
## cylinders    -0.7763    1.0000       0.9509     0.8430  0.8970
## displacement -0.8044    0.9509       1.0000     0.8973  0.9331
## horsepower   -0.7784    0.8430       0.8973     1.0000  0.8645
## weight       -0.8317    0.8970       0.9331     0.8645  1.0000
## acceleration  0.4223   -0.5041      -0.5442    -0.6892 -0.4195
## year          0.5815   -0.3467      -0.3698    -0.4164 -0.3079
##              acceleration    year
## mpg                0.4223  0.5815
## cylinders         -0.5041 -0.3467
## displacement      -0.5442 -0.3698
## horsepower        -0.6892 -0.4164
## weight            -0.4195 -0.3079
## acceleration       1.0000  0.2829
## year               0.2829  1.0000
fit = lm(mpg ~ ., auto[,1:8])
plot(fit)

summary(fit)
## 
## Call:
## lm(formula = mpg ~ ., data = auto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders    -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement  2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower   -1.818e-02  1.371e-02  -1.326 0.185488    
## weight       -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration  7.910e-02  9.822e-02   0.805 0.421101    
## year          7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEurope  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapan   2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16
## a.The correlation between mpg and displacement is -0.8044, there is a strong negative relationship between displacement and mpg.
## b. They are highy and negatively correlated, which means that increase in displacement will produce decrease in mpg.
## c. Yes, because the p-value of the F-statistic is less than 2.2e-16.
## d. Displacement, weight, year, originEurope, originJapan have significant 
## e. It suggests that a unit increase in year associates with 7.982e-01 increase in mpg on average, holding other predictors constant, on average.
## f. The slope coefficient for displacement means that as displacement increase by 1 unit, mpg increase by 0.02398 on average, holding other variables constant.
Problem 3
fit2 = lm(mpg ~ cylinders + displacement+ weight+ year, auto)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + weight + year, 
##     data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.995 -2.270 -0.165  2.053 14.368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -14.076941   4.055159  -3.471 0.000575 ***
## cylinders     -0.289589   0.329225  -0.880 0.379611    
## displacement   0.004973   0.006701   0.742 0.458425    
## weight        -0.006702   0.000572 -11.717  < 2e-16 ***
## year           0.764751   0.050684  15.089  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.436 on 392 degrees of freedom
## Multiple R-squared:  0.8091, Adjusted R-squared:  0.8072 
## F-statistic: 415.5 on 4 and 392 DF,  p-value: < 2.2e-16
library(car)
## Loading required package: carData
vif(fit2)
##    cylinders displacement       weight         year 
##    10.524432    16.406259     7.888061     1.173000
fit3 = lm(mpg ~ cylinders + displacement+ year, auto)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + year, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0801  -2.6445  -0.2925   2.1004  14.9103 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -18.199719   4.688296  -3.882 0.000122 ***
## cylinders     -0.620910   0.380657  -1.631 0.103658    
## displacement  -0.041545   0.006265  -6.632  1.1e-10 ***
## year           0.699324   0.058461  11.962  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 393 degrees of freedom
## Multiple R-squared:  0.7423, Adjusted R-squared:  0.7403 
## F-statistic: 377.3 on 3 and 393 DF,  p-value: < 2.2e-16
fit4 = lm(mpg ~ cylinders + year, auto)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ cylinders + year, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6462  -2.8847  -0.1399   2.5095  15.6875 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.30285    4.93534  -3.506 0.000507 ***
## cylinders    -3.00405    0.13223 -22.718  < 2e-16 ***
## year          0.75289    0.06098  12.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.2 on 394 degrees of freedom
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.712 
## F-statistic: 490.5 on 2 and 394 DF,  p-value: < 2.2e-16
## a. The signs of weight and year show that they have p-value less than 0.001, indicating that the two estimated coefficients are significantly different from 0. The signs of cylinders and displacement show that they have p-value between 0.1 and 1, indicating that the two estimated coeficients are not significantly different from 0. The value of R2 is 0.8091.which means that 80.91% of variation in mpg can be explained by cylinders, displacement, weight and year combined.
## b. The vif for cylinders and displacement are both larger than 10, which means that
## there is a high multicollinearity and the variation will seem larger and the 
## factor will appear to be more influential than it is.
## c. The estimated coefficients of displacement became significantly different from 0, and the absolute value of the estimated coefficients of displacement increased, but that of year decreased. The r-square is 0.7403, which means that 74.03% of variation mpg can be explained by cylinders, displacement and year together.
## d.The estimated coefficients of cylinders became significantly different from 0, and the absolute value of the estimated coefficients of cylinders also increased.The r-square is 0.7135, which means that 71.35% of variation mpg can be explained by cylinders and year together.

problem 4

set.seed(1)
x1=runif(100)
x2=0.5*x1+rnorm(100)/10
y=2+2*x1+0.3*x2+rnorm(100)
cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)

a = data.frame(x1, x2)
fit4 <- lm(y ~ .,a)
summary(fit4)
## 
## Call:
## lm(formula = y ~ ., data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
## y = 2.1305 + 1.4396X1 + 1.0097X2