IEMS 304 Lab 2

training <-read.csv("training.csv", header = TRUE)
test <-read.csv("test.csv", header = TRUE)

y1 = training$y
y2 = test$y

Problem 1

plot(training)

Problem 2

fit1 = lm(y~., data = training)

summary(fit1)

## 
## Call:
## lm(formula = y ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.8630  -7.0947  -0.8172   7.3730  15.2612 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 117.49984  128.02381   0.918   0.3754  
## x1            2.87129    1.27512   2.252   0.0423 *
## x2           28.14110   11.73692   2.398   0.0322 *
## x3           -4.51207    4.24980  -1.062   0.3077  
## x4            0.09924    2.78472   0.036   0.9721  
## x5           -7.47554    4.10461  -1.821   0.0917 .
## x6          -16.02624   16.11081  -0.995   0.3380  
## x7           -0.18452    1.24310  -0.148   0.8843  
## x8            1.21552    7.30754   0.166   0.8705  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.35 on 13 degrees of freedom
## Multiple R-squared:  0.8048, Adjusted R-squared:  0.6846 
## F-statistic: 6.698 on 8 and 13 DF,  p-value: 0.00142

fit1_yhat1 = predict(fit1)
fit1_yhat2 = predict(fit1, test)

plot(y1, fit1_yhat1, col = "red", xlab = "y", ylab = "yhat")
points(y2, fit1_yhat2, col = "blue")

From the above scatter plot of the training and testing data, we can see that there appears to be a moderate relationship between y and yhat. The test data set is represented by red squares as the training data set is represented by the blue outlined triangles. As we can tell, the training data set seems to be more linearly positively correlated since the range is smaller and the points seem to align along a positively increasing line. There are a lot more testing data points for the testing data set. For the testing data set, the range is much larger and the standard deviation appears to be greater. We can also see that there are a few points that could be outliers. Specifically, we would need to analyze the points that are near (110, 230), (230, 200), and (90, 165) to determine if we should take them out of the testing data set.

Problem 3

fit1_predict <-predict(fit1, test, se.fit = T, level = 0.95, interval = "prediction")
lower <-data.frame(fit1_predict$fit)$lwr
upper <- data.frame(fit1_predict$fit)$upr
test3 = matrix(ncol = 1, nrow = 136)
for(i in 1:136){
  if(y2[i] <= upper[i] & y2[i] >= lower[i]) { test3[i] = 1 }
  else { test3[i]=0 }
}
c(test3)

##   [1] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
##  [36] 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1

The code above provides us with a vector that tells us whether the actual y-test values are within the 95% prediction interval of the predicted y test data. A value of 1 signifies that the y-test value is within the prediction interval, while a value of 0 signifies that the y-test value is not within the prediction interval.

The actual response values for the test data seem relatively consistent with the prediction intervals, Except for 14 data points, the data seems consistent with the prediction interval. These 14 exceptions are either below the lower limit given by the prediction interval or greater than the upper limit. Approximately, 10% or 14 out of 136 are inconsistent with the prediction interval. This percentage may decrease as n or the number of data points increases.

boxplot(y2-fit1_yhat2, ylab = "Prediction Error", main="Prediction Errors for Eight Predictor Regression")

Problem 4

influence(fit1)

## $hat
##         1         2         3         4         5         6         7 
## 0.4041407 0.2795710 0.3571954 0.2934941 0.4690378 1.0000000 0.5869309 
##         8         9        10        11        12        13        14 
## 0.2456556 0.2655125 0.3282329 0.5009312 0.4783123 0.6389695 0.5118806 
##        15        16        17        18        19        20        21 
## 0.5043554 0.4520656 0.1467014 0.2916595 0.4381696 0.3608521 0.2040632 
##        22 
## 0.2422688 
## 
## $coefficients
##     (Intercept)          x1          x2          x3           x4
## 1   -22.0686044  0.05527672  -1.7434975  1.17867217  0.087676747
## 2    11.7824018  0.41438709   1.4448364 -2.00274988 -0.640488748
## 3     2.4729111 -0.03179734  -1.3779075  0.49873125  0.545388278
## 4    -1.1625211 -0.02992176  -2.2020554  0.60121279 -0.040178053
## 5  -101.4093446  1.78198789 -14.1496440 -5.33979251 -4.440022916
## 6     0.0000000  0.00000000   0.0000000  0.00000000  0.000000000
## 7   -16.6718542  0.07571375   0.8321214  0.43865994  0.699649840
## 8   -10.3606443 -0.03400395  -0.7779314  0.08488997  0.992518272
## 9     8.4520092 -0.04292018   1.5877360 -0.30902114  0.464929823
## 10   47.1958070 -0.43750322   2.6991114 -0.17836953  0.636658873
## 11  -16.7955347 -0.15660847  -1.2303658  0.48054547  0.737194945
## 12   22.6027850 -0.13233193   1.5367574 -0.37029764  0.005677094
## 13    4.8263265  0.11579012  -1.2621619  0.15758263  0.097202805
## 14  -50.8458045  0.08237629  -4.8009316  2.70392342 -0.009127930
## 15    7.3495162 -0.94985257   7.9797244  3.06800231  0.416448118
## 16  -25.7772018  0.24866756  -0.2706580 -0.91819365 -0.307529430
## 17    0.7513507 -0.04398994   1.6044470 -0.27355734 -0.094813831
## 18  -15.1936523  0.19428686  -1.5003447 -0.10223713 -0.402046006
## 19    5.4725569 -0.12110619   0.0887398  0.08343854 -0.092681066
## 20   67.5864262 -0.47367617   2.8091377 -0.03466579 -0.200270406
## 21   25.4227561 -0.25255525   2.4123581  0.23524796 -0.097537169
## 22   10.9399461 -0.05915243   2.5722446  0.77195640  1.084265853
##             x5           x6           x7         x8
## 1   0.76926431   2.46500373 -0.441365345  1.3738900
## 2  -1.53380373  -3.53967351  0.486596365 -2.6392430
## 3  -0.05377553   1.02070942 -0.264935770  0.3204748
## 4   0.17981917   0.40269451 -0.105547925  0.1631656
## 5   0.75125639 -10.94445876  1.417295874 -1.4908853
## 6   0.00000000   0.00000000  0.000000000  0.0000000
## 7   0.27247449   2.13054313 -0.001443597 -0.1477861
## 8   0.29646358   2.73450334  0.126852044  1.7402116
## 9  -0.11874178  -2.31727259 -0.371197080 -1.9912310
## 10 -0.98906563   0.05947887 -0.153482482  2.7594600
## 11  1.10290663   3.07506536  0.042172199 -0.1705749
## 12 -0.56949700  -1.54908374 -0.063885657 -0.1131421
## 13 -0.61778797  -0.19009763 -0.002454045  0.6907727
## 14  1.91404093   1.72544488 -0.060781891 -2.6861837
## 15  2.10292033   3.21886704 -1.122940070  1.8416164
## 16  0.71326041   0.15713790 -0.443392244  0.9020965
## 17  0.11283438  -0.07451577  0.131687969 -1.2508976
## 18  0.31817191  -1.90551291 -0.080553129 -1.6328815
## 19  0.09057782   0.13937810  0.080492102  0.5586699
## 20 -1.95200416   1.38801901  0.361178275  4.7035307
## 21 -0.44969639  -0.33459066  0.009230422 -0.3825024
## 22 -0.78388664   2.90808393  0.249514100 -1.4287115
## 
## $sigma
##         1         2         3         4         5         6         7 
## 12.642056 12.090358 12.701448 12.717304  9.491406 12.849375 12.798875 
##         8         9        10        11        12        13        14 
## 12.180799 12.306915 12.537277 12.667233 12.819128 12.825359 12.443642 
##        15        16        17        18        19        20        21 
## 11.597460 12.485877 12.545997 12.728824 12.829752 12.081737 12.536888 
##        22 
## 11.810680 
## 
## $wt.res
##          1          2          3          4          5          6 
##   6.147222  12.792572   5.399532   5.350428 -21.863009   0.000000 
##          7          8          9         10         11         12 
##  -2.533808 -12.307860 -10.967046  -7.991864   5.275790  -2.204621 
##         13         14         15         16         17         18 
##  -1.634401  -7.753073 -13.491942   7.781593   8.881985  -5.119546 
##         19         20         21         22 
##  -1.843163  12.115471   8.704525  15.261214

We would mark any point as high leverage if it satisfies the condition h (ii) > 2(k+1)/n. Our h(ii) for this data set would be 9/11 or 0.818. Point 6 would be considered high leverage since it has an h(ii) level of 1. When we look at data point 6, we can see that this person may be high leverage since he is a male that weigh 170 and is 72 inches tall. He appears to be abnormal since he is heavy for his weight.

cooks.distance(fit1)

##           1           2           3           4           5           6 
## 0.031358733 0.064265959 0.018374510 0.012271577 0.579770559         NaN 
##           7           8           9          10          11          12 
## 0.016100690 0.047676898 0.043156997 0.033868524 0.040811989 0.006227502 
##          13          14          15          16          17          18 
## 0.009546962 0.094149938 0.272458519 0.066471726 0.011587979 0.011107352 
##          19          20          21          22 
## 0.003438069 0.094528505 0.017793169 0.071647470

We would mark any point as high influence if it satisfies the condition D(i) > 1 or D(i) > 4/n . Our D(i) for this data set would be 4/22 or 0.18. Points 5 and 15 would be considered high influence since they have D(i) levels of 0.5798 and 0.2725 respectively. In point 5 in the data set, we see that the person is a 135 pound male who is only 68 inches tall. We can assume that he is abnormal since he is short for a male. Point 15 is a 123 pound female who is 67 inches tall. From this, we assume that it is high influence since she is tall for a female.

Problem 5

plot(predict(fit1), rstandard(fit1))

The above standardized residual versus the fitted values plot shows us that there may be a slight increasing trend since the data seems to fan out, but it appears to be independent and random. ### Problem 6

fit2 = lm(y~x1+x2, data = training)

summary(fit2)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.851  -9.814   1.855   7.965  18.782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -74.249     65.003  -1.142  0.26754   
## x1             3.092      1.016   3.044  0.00667 **
## x2            17.866      9.467   1.887  0.07452 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.74 on 19 degrees of freedom
## Multiple R-squared:  0.7421, Adjusted R-squared:  0.715 
## F-statistic: 27.34 on 2 and 19 DF,  p-value: 2.559e-06

fit2_yhat1 = predict(fit2)
fit2_yhat2 = predict(fit2, test)

plot(y1, fit2_yhat1, col = "red", xlab = "y", ylab = "yhat")
points(y2, fit2_yhat2, col = "blue")

From the above scatter plot of the training and testing data, we can see that there still appears to be a moderate relationship between y and yhat. Again, the test data set is represented by red squares as the training data set is represented by the blue outlined triangles. Except for 16 data points, the data seems consistent with the prediction interval. These 16 exceptions are either below the lower limit given by the prediction interval or greater than the upper limit. Approximately, 12% or 16 out of 136 are inconsistent with the prediction interval. This percentage may decrease as n or the number of data points increases.

We would mark any point as high leverage if it satisfies the condition h (ii) > 2(k+1)/n. Our h(ii) for this data set would be 3/11 or 0.273. Point 13 would be considered high leverage since it has an h(ii) level of 0.332, as well as point 15 which has a h(ii) level of 0.277. When we look at data point 13, we can see that this person may be high leverage since he is a male that weigh 135, which is a lot less than all other males and for point 15 we see the same ordeal for a female. For this reason it is evident that these points are high leverage as is validated by the numbers.

We would mark any point as high influence if it satisfies the condition D(i) > 1 or D(i) > 4/n. Our D(i) for this data set would be 4/22 or 0.18. Once again, since no data point is above a 0.18 in influence we can assume that none of the data points are influential.

plot(predict(fit2), rstandard(fit2))

The above standardized residual versus the fitted values plot is very similar to the previous fit and still indicates that there could be a non-constant variance.

boxplot((y2-fit1_yhat2),(y2-fit2_yhat2), names=c("Eight Predictors","Two Predictors"), ylab = "Prediction Error", main="Prediction Errors for Test Data")

The first data set has a lower first quartile, a higher third quartile, and a wider variance. It also appears to have a mildoutlier around 60 and an extreme outlier above 110. The kurtosis of the distribution is larger, which signifies a wider peak. The distribution seems to be relatively normal, but may be mildly skewed right. The second distribution has a low variance and more outliers above the highest value and below the lowest value.

Problem 7

Alt

Problem 8

After this analysis, we would choose to use the two predictor model if we had to choose one to predict weights. From the Partial F analysis, we found that we could not reject that any of the - differed significantly from 0. This tells us that they had very little effect in the prediction of the response weight.

This can be somewhat verified by looking at the two scatterplots of predicted ̂vs true y values. Theoretically, this should be a 45 degree line from the origin (y=x) since the goal of the predictors is to accurately represent the actual value. We see that the two predictor model is significantly less variation in the test data points, indicating that it is a better model. Also, the comparative boxplots confirmed there was less variation in the two predictor model since the body of the plot was much smaller than the 8 predictor model.

When we think about it, this makes sense because predictors 3-8 either intuitively seem like they would not have an effect on weight (ex. last digit of home phone number, will Cubs win World Series) or should be collinear and related to the strong predictor variables (ex. age), thus also not having a large effect on the response.

From this lab, we were able to see why it is important to choose relevant predictors and not just throw in all the data you have. Overfitting is a danger that we must be aware of and by doing so, we will be able to improve the results and effectiveness of our models.