hw3.R

suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("tidyr"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("car"))
suppressPackageStartupMessages(library("gridExtra"))

Problem 1

Part A

Write down the \(X\) matrix and the \([X'X]^{-1}\) for this problem.

\[ \begin{align} X &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} \end{align} \]

\[ (X'X)^{-1} = \frac{1}{\sum x_i^2} \]

Part B

Show that the entries of the hat matrix \(H = X[X'X]^{-1}X'\) are \(h_{ij} = \frac{x_ix_j}{\sum_{i=1}^nx_i^2}, (1 \le i, j, \le n)\).

\[(X'X)^{-1} = \frac{1}{\sum x_i^2}\] \[(X'X)^{-1}X' = \frac{x_j}{\sum x_i^2}\] \[X(X'X)^{-1}X' = \frac{x_ix_j}{\sum x_i^2}, (1 \le i, j, \le n)\]

Part C

Using part (b), write down the expression for the leverage coefficient \(h_{ii}\), and provide an interpretation of why it measures how unusual \(x_i\) is.

\[h_{ii} = \frac{x_ix_i}{\sum_i^nx_i^2}=\frac{k+1}{n}\]

Explain why an observation with large \(h_{ii}\) would be more likely to be influential (use an appropriate sketch to argue your point).

If a given data point is moved up or down, the corresponding \(\hat{y_i}\) will move propotionally to the change in \(y_i\). The proprtionality constant is called leverage, and denoted by \(h_{ii}\) along the diagonals of the hat matrix.

The leverage of the data point measures the impact that \(y_i\) has on \(\hat{y_i}\).

The further away \(x_i\) is from \(\bar{x}\), the larger \(h_i\), and therefore the more sensitive \(\hat{y_i}\) is to changes in \(y_i\).

So the points with very large and very small x values (compared to \(\bar{x}\)) have more leverage than points with intermediate x values.

If for some reason, a point with high leverage also happens to be from from the least squares line which would be fitted to the remaining data points, then we can say that it has high influence.

A variable with high leverage tends is more likely to have higher influence because most datasets have non-constant variance. As one moves farther away from \(\bar{x}\) towards very small or very large values, the variance tends to increase making the predicted \(\hat{y_i}\) farther away from the “real” regression line. Distance from the regression line is essential the definition of high influence. Thus values that have very high or very low leverage tend to also have high influence (though not necessarily).

Alt text

The sketch above shows the black regression line and the red line which shows what the regression line would like like without that point.

Problem 2

Part A

Make a scatter plot of the two predictors.

prob2 <- read.csv("prob2.csv", header=TRUE)
ggplot(prob2, aes(x=Gravity, y=Moisture)) + geom_point() + geom_text(label=rownames(prob2), hjust=-1)

Which observation appears to be high leverage.

We can see that beam number 4 with gravity of 0.441 and moisture of 8.9 is an outlier in the data set and appears to have high leverage.

Part B

Using R, manually calculate the hat matrix H for these data, and obtain the leverage coefficients from H. Verify that these are the same as the lever coefficients produced by the influence.lm commmand.

X <- as.matrix(cbind(1, prob2[c("Gravity", "Moisture")]))
H <- X %*% solve(t(X) %*% X) %*% t(X)
diag(H)

##  [1] 0.4178935 0.2418666 0.4172806 0.6043904 0.2521824 0.1478688 0.2616385
##  [8] 0.1540321 0.3155106 0.1873364

influence(lm(Strength~Gravity+Moisture, data=prob2), do.coef=TRUE)$hat

##         1         2         3         4         5         6         7 
## 0.4178935 0.2418666 0.4172806 0.6043904 0.2521824 0.1478688 0.2616385 
##         8         9        10 
## 0.1540321 0.3155106 0.1873364

Is the same observation found to be high leverage in part (a) identified as high leverage using the rule-of thumb from lecture.

Rule of Thumb: \(h_{ii} > 2(k+1)/n = 2(2+1)/10 = 0.4\) Only beam 4 is influential according to the rule of thumb just as in part (a).

Part C

Find Cook’s distance for each observation.

cooks.distance(lm(Strength~Gravity+Moisture, data=prob2))

##            1            2            3            4            5 
## 1.069240e+00 8.723020e-03 9.208950e-03 4.756415e-01 1.238171e-01 
##            6            7            8            9           10 
## 1.810604e-01 3.418372e-02 1.303120e-02 1.288740e-02 9.905523e-05

Are any observations influential according to the two rules-of-thumb for Cook’s distance given in lecture?

The rules of thumb used to identify influential observations is 1 and 4/n. Using these criteria, beams 1 and 4 are identified as high influential.

Part D

Fit the model y = B0 + B1x1 + B2x2 + e based on all 10 observations and compare it with the fit obtained after omitting the high-leverage observation.

model1 <- lm(Strength~Gravity+Moisture, data=prob2)
summary(model1)

## 
## Call:
## lm(formula = Strength ~ Gravity + Moisture, data = prob2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44422 -0.12780  0.05365  0.10521  0.44985 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.3015     1.8965   5.432 0.000975 ***
## Gravity       8.4947     1.7850   4.759 0.002062 ** 
## Moisture     -0.2663     0.1237  -2.152 0.068394 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2754 on 7 degrees of freedom
## Multiple R-squared:    0.9,  Adjusted R-squared:  0.8714 
## F-statistic:  31.5 on 2 and 7 DF,  p-value: 0.0003163

model2 <- lm(Strength~Gravity+Moisture, data=prob2[c(-1,-4),])
summary(model2)

## 
## Call:
## lm(formula = Strength ~ Gravity + Moisture, data = prob2[c(-1, 
##     -4), ])
## 
## Residuals:
##        2        3        5        6        7        8        9       10 
##  0.09143 -0.02819 -0.18100  0.29226  0.04599 -0.07148  0.06689 -0.21590 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  6.558277   2.953375   2.221   0.0771 . 
## Gravity     11.058974   2.349310   4.707   0.0053 **
## Moisture    -0.009058   0.187776  -0.048   0.9634   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1927 on 5 degrees of freedom
## Multiple R-squared:  0.9586, Adjusted R-squared:  0.942 
## F-statistic: 57.87 on 2 and 5 DF,  p-value: 0.000349

Does the fit change much? Is this consistent with part (c)?

The fit and coefficients change dramatically.

What R command will automatically tell you how much the coefficients change when an observation is removed from the data set?

influencePlot(model1)

##     StudRes       Hat     CookD
## 1 -3.254079 0.4178935 1.0340406
## 4 -0.961169 0.6043904 0.6896676

influence(model1)$coefficients

##    (Intercept)     Gravity     Moisture
## 1   2.70978830 -1.77234453 -0.193179409
## 2   0.04577956  0.08223499 -0.007837448
## 3  -0.06863025  0.18870882 -0.001792734
## 4  -2.10913390  1.69549658  0.124227858
## 5  -0.36721933 -0.13195343  0.040396826
## 6  -0.64229159  0.74812336  0.032863049
## 7   0.12027205 -0.30818178  0.005015497
## 8  -0.15285801  0.06422939  0.013657326
## 9   0.16688287 -0.25503014 -0.003135174
## 10  0.01292332 -0.00297048 -0.001270209

influencePlot(model2)

##      StudRes       Hat     CookD
## 5 -1.3478301 0.4349312 0.6329696
## 6  2.5694835 0.2611370 0.6056525
## 9  0.4795372 0.5566941 0.3373147

influence(model2)$coefficients

##    (Intercept)    Gravity     Moisture
## 2    0.5211812 -0.2271628 -0.039663013
## 3    0.1442963 -0.2045347 -0.004877316
## 5   -2.1875246  1.2701170  0.153368478
## 6   -2.1320648  1.7763916  0.131052814
## 7    0.1793832 -0.2299714 -0.005630012
## 8    0.5693009 -0.3819653 -0.039646614
## 9    1.1608675 -1.0936896 -0.060269901
## 10   2.1033311 -1.3409835 -0.149585471

Which fitted model would you use to predict the strength of a new wood beam?

In general, the models without the influential observations should be used.

Problem 3

prob3 <- read.csv("prob3.csv", header=TRUE)
model2 <- lm(y~x1+x2, prob3)

Part A

Plot the residuals from the fitted linear model y = B0 + B1x1 + B2x2 + e versus x1 and x2.

ggplot(gather(prob3, var, x, -y), aes(x=x, y=c(rstandard(model2), rstandard(model2)))) + 
  geom_point() + facet_wrap(~var)

Do these plots suggest the need to fit a quadratic model like the one you fitted in Problem 5 of HW 2?

Both plots have a curved pattern with negative residuals at the extremes and positive residuals in the middle. This suggests that a quadratic model might be more appropriate.

Part B

Plot the residuals from the quadratic fit versus \(x_1\), and \(x_2\). Plot the standardized residuals versus \(\hat{y}\).

model3 <- lm(y~x1+x2+I(x1^2)+I(x2^2), prob3)
ggplot(gather(prob3, var, x, -y), aes(x=x, y=c(rstandard(model3), rstandard(model3)))) + 
  geom_point() + facet_wrap(~var)

There is no longer any pattern in the residuals.

Suggest a suitable transformation to remove the non-constant variance exhibited in the plot of residuals versus \(\hat{y}\).

grid.arrange(
  qplot(x=predict(model3), y=model3$residuals) + geom_point(),
  qplot(x=predict(model3), y=log(model3$residuals)) + geom_point(),
  ncol=2
)

## Warning in log(model3$residuals): NaNs produced

## Warning in log(model3$residuals): NaNs produced

## Warning in log(model3$residuals): NaNs produced

## Warning: Removed 23 rows containing missing values (geom_point).

## Warning: Removed 23 rows containing missing values (geom_point).

There is non-constant variance in this dataset that increases as the predicted value increases. The variance heterogenity can possibly be removed via a log transformation or weighted least squares, although you have to specfically know what is causing the non-constant variance because you create weights to correct it in WLS.

Problem 4

prob4 <- read.csv("prob4.csv", header=TRUE)

Part A

Fit the linear regression model log(y) = B0 + B1log(x1) + B2log(x2) + e using natural logs.

model4 <- lm(log(Mercury)~log(Alkalin)+log(calcium), prob4)
summary(model4)

## 
## Call:
## lm(formula = log(Mercury) ~ log(Alkalin) + log(calcium), data = prob4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78254 -0.27777 -0.01785  0.20442  1.25335 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.20502    0.18127  39.748  < 2e-16 ***
## log(Alkalin) -0.50784    0.09941  -5.109 1.15e-05 ***
## log(calcium)  0.13584    0.10158   1.337     0.19    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4076 on 35 degrees of freedom
## Multiple R-squared:  0.5999, Adjusted R-squared:  0.577 
## F-statistic: 26.24 on 2 and 35 DF,  p-value: 1.091e-07

Plot the residuals from this fit against log(x1), log(x2), and the fitted values (for log-y).

ggplot(gather(prob4, var, x) %>% filter(var != "pH"), aes(x=log(x), y=c(resid(model4), resid(model4), resid(model4)))) + 
  geom_point() + facet_wrap(~var, scales="free_x")

Comment on the shapes. Do any plots suggest that a further transformation is needed? Explain.

Except for the large outlier, the graphs don’t have any unusual patterns. The variances are stable thus they don’t need any further transformations.

Part B

The variable x3 = pH was omitted in part (a). Plot the residuals versus x3.

ggplot(gather(prob4, var, x) %>% filter(var == "pH"), aes(x=x, y=c(rstandard(model4)))) + 
  geom_point() + facet_wrap(~var, scales="free_x")

Explain whether the plot indicates that pH should be included in the model. Is this result consistent with Problem 5 below regarding which predictors should be included in the model?

Although, there is no unusual pattern in this residual plot, there is no reason to include pH in the model because it contributes little to the fit. This conclusion is consistent with Problem 5 below regarding which predictors should be included in the model.

The confidence interval for B3 and the extra sum of squares F-test from Problem 5 show that pH can be omitted from the model. ### Problem 3

model5 <- lm(log(Mercury)~log(Alkalin)+log(calcium)+pH, data=prob4)
summary(model5)

## 
## Call:
## lm(formula = log(Mercury) ~ log(Alkalin) + log(calcium) + pH, 
##     data = prob4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75244 -0.30191 -0.00783  0.23852  1.22932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.55983    0.48981  15.434  < 2e-16 ***
## log(Alkalin) -0.45880    0.11807  -3.886 0.000449 ***
## log(calcium)  0.14702    0.10315   1.425 0.163185    
## pH           -0.07998    0.10248  -0.780 0.440527    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4099 on 34 degrees of freedom
## Multiple R-squared:  0.6069, Adjusted R-squared:  0.5723 
## F-statistic:  17.5 on 3 and 34 DF,  p-value: 4.808e-07

anova(model5)

## Analysis of Variance Table
## 
## Response: log(Mercury)
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## log(Alkalin)  1 8.4211  8.4211 50.1241 3.541e-08 ***
## log(calcium)  1 0.2971  0.2971  1.7685    0.1924    
## pH            1 0.1023  0.1023  0.6091    0.4405    
## Residuals    34 5.7122  0.1680                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

hw3.R

atul

Mon Apr 27 15:52:30 2015

Problem 1

Part A

Part B

Part C

Problem 2

Part A

Part B

Part C

Part D

Problem 3

Part A

Part B

Problem 4

Part A

Part B