Homework #1

Jerry Zhang
36-617 Applied Linear Models
Dr. T. Gaugler
Due September 4, 2013

setwd("C:/Users/vloei_000/Dropbox/School/Regression/PS 1")

## Error: cannot change working directory

1.11

The training program does increase production output for all values of X ranging from 40 to 100. The \( \beta_1 \) of 0.95 cannot be individually evaluated without considering the \( \beta_0 \) of 20. By plotting the production output after the training, we see that production is increased for all X within the range.

X = seq(40, 100, 5)
Y = X * 0.95 + 20

plot(X, Y, type = "l", main = "Production Output After Training", xlab = "Pre-training Production Output", 
    ylab = "Post-training Production Output")

plot of chunk unnamed-chunk-3

In fact, we see that even an employee with 40 pre-training production output will have a production output of 58.

1.19

dataCH01PR19 = read.table("CH01PR19.txt")
names(dataCH01PR19) = c("GPA", "ACT")
attach(dataCH01PR19)
lmCH01PR19 = lm(GPA ~ ACT)
summary(lmCH01PR19)

## 
## Call:
## lm(formula = GPA ~ ACT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7400 -0.3383  0.0406  0.4406  1.2274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1140     0.3209    6.59  1.3e-09 ***
## ACT           0.0388     0.0128    3.04   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.623 on 118 degrees of freedom
## Multiple R-squared:  0.0726, Adjusted R-squared:  0.0648 
## F-statistic: 9.24 on 1 and 118 DF,  p-value: 0.00292

a. The \( \beta_0 \) and \( \beta_1 \) are 2.114 and 0.0388 respectively.

b. Scatterplot of data and regression line. The regression fits the data fairly well and has a R² value of 0.0726. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05.

plot(ACT, GPA, main = "Predicting GPA from ACT Score", ylab = "GPA", xlab = "ACT")
abline(lmCH01PR19)

plot of chunk unnamed-chunk-5

c. The point estimate of mean freshmen GPA for students with ACT test score of 30 is 3.2789. The 95% prediction interval is between 2.6576 and 3.2012.

predict(lmCH01PR19, data.frame(ACT = 30), interval = "predict")

##     fit   lwr   upr
## 1 3.279 2.033 4.525

d. The point estimate of the change in mean response when ACT score score is increased by one point is equal to \( \beta_1 \), which has the value: 0.0388. The prediction interval for \( \beta_1 \) is between 0.0135 and 0.0641.

confint(lmCH01PR19, interval = "prediction")

##               2.5 %  97.5 %
## (Intercept) 1.47859 2.74951
## ACT         0.01353 0.06412

detach(dataCH01PR19)

1.20

dataCH01PR20 = read.table("CH01PR20.txt")
names(dataCH01PR20) = c("Time", "N")
attach(dataCH01PR20)
lmCH01PR20 = lm(Time ~ N)
summary(lmCH01PR20)

## 
## Call:
## lm(formula = Time ~ N)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.772  -3.737   0.333   6.333  15.404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.580      2.804   -0.21     0.84    
## N             15.035      0.483   31.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.91 on 43 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.957 
## F-statistic:  969 on 1 and 43 DF,  p-value: <2e-16

a. The regression function is Time = -0.5802 + 15.0352N + \( \epsilon\ \), where N is the number of copiers serviced and Time is the total number of minutes spent by the service person.

b. Scatterplot of data and regression line. The regression does not fit the data very well considering that the intercept term is non-significant at \( \alpha\ \) < 0.05. See part c.

plot(N, Time, main = "Maintenance Time verus Number of Copiers Serviced", ylab = "Time (Min)", 
    xlab = "Number of Copiers Serviced")
abline(lmCH01PR20)

plot of chunk unnamed-chunk-10

c. The \( \beta_0 \) has a value of -0.5802 and a corresponding p-value of 0.8371. We thus fail to reject the null hypothesis that \( \beta_0 \) = 0. It would not be right to draw interpretations from \( \beta_0 \).

d. The point estimate of of mean service time when N = 5 copiers are serviced is 74.5961. We should not put much weight in this point estimate since the intercept term has been shown to be not significant.

predict(lmCH01PR20, data.frame(N = 5), interval = "predict")

##    fit   lwr   upr
## 1 74.6 56.42 92.77

detach(dataCH01PR20)

1.22

dataCH01PR22 = read.table("CH01PR22.txt")
names(dataCH01PR22) = c("Hardness", "Hour")
attach(dataCH01PR22)
lmCH01PR22 = lm(Hardness ~ Hour)
summary(lmCH01PR22)

## 
## Call:
## lm(formula = Hardness ~ Hour)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.150 -2.219  0.162  2.688  5.575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 168.6000     2.6570    63.5  < 2e-16 ***
## Hour          2.0344     0.0904    22.5  2.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.23 on 14 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.971 
## F-statistic:  507 on 1 and 14 DF,  p-value: 2.16e-12

a. The regression function is \( Y_i = 168.6+2.0344·X_i \), where Y is the hardness in Brinell units and X is the elapsed time in hour. The regression fits the data very well and has a R² value of 0.9731. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05.

plot(Hour, Hardness, main = "Hardness versus Time", xlab = "Time (Hour)", ylab = "Hardness (Brinell)")
abline(lmCH01PR22)

plot of chunk unnamed-chunk-14

b. The point estimate of the mean hardness at 40 hours is 249.975 Brinell units, with a 95% prediction interval between 242.4562 to 257.4938.

predict(lmCH01PR22, data.frame(Hour = 40), interval = "predict")

##   fit   lwr   upr
## 1 250 242.5 257.5

c. When time is increased by 1 hour, the point estimate of the change in mean hardness increases by \( \beta_1 \) = 2.0344. The 95% prediction interval is between 1.8405 and 2.2283.

confint(lmCH01PR22, interval = "prediction")

##              2.5 %  97.5 %
## (Intercept) 162.90 174.299
## Hour          1.84   2.228

detach(dataCH01PR22)

1.27

dataCH01PR27 = read.table("CH01PR27.txt")
names(dataCH01PR27) = c("Muscle", "Age")
attach(dataCH01PR27)
lmCH01PR27 = lm(Muscle ~ Age)
summary(lmCH01PR27)

## 
## Call:
## lm(formula = Muscle ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.137  -6.197  -0.597   6.761  23.473 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.3466     5.5123    28.4   <2e-16 ***
## Age          -1.1900     0.0902   -13.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.17 on 58 degrees of freedom
## Multiple R-squared:  0.75,   Adjusted R-squared:  0.746 
## F-statistic:  174 on 1 and 58 DF,  p-value: <2e-16

a. The regression function is \( Y_i = 156.3466+-1.19·X_i + \epsilon \), where Y is a measure of muscle mass and X is age in year. The regression fits the data very well and has a R² value of 0.7501. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05. There is a clear relationship between decreasing muscle mass and age.

plot(Age, Muscle, main = "Muscle Mass versus Age", xlab = "Age (Year)", ylab = "Muscle Mass (unit)")
abline(lmCH01PR27)

plot of chunk unnamed-chunk-19

b. (1) The point estimate for the difference in mean muscle mass for women differing in age by one year is -1.19, with a 95% prediction interval between -1.0094 and -1.3705

confint(lmCH01PR27, interval = "prediction")

##               2.5 %  97.5 %
## (Intercept) 145.313 167.381
## Age          -1.371  -1.009

(2) The point estimate of the mean muscle mass for women aged 60 years is 84.9468 units, with a 95% prediction interval between 68.4507 and 101.443 units.

predict(lmCH01PR27, data.frame(Age = 60), interval = "prediction")

##     fit   lwr   upr
## 1 84.95 68.45 101.4

(3) The value of the 8th residual is 4.4433

lmCH01PR27$resid[8]

##     8 
## 4.443

(4) A point estimate of the \( \sigma^2 \) is 66.8008

anova(lmCH01PR27)

## Analysis of Variance Table
## 
## Response: Muscle
##           Df Sum Sq Mean Sq F value Pr(>F)    
## Age        1  11627   11627     174 <2e-16 ***
## Residuals 58   3874      67                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

detach(dataCH01PR27)

1.29

If \( \beta_0 \) was forced to be 0 so that the model becomes \( Y_i = \beta_1 ·X_i + \epsilon_i \), the regression line is forced to go through the origin. The X and Y intercept would both be at (0,0).

1.30

If \( \beta_1 \) was forced to be 0 so that the model becomes \( Y_i = \beta_0 + \epsilon_i \), the regression line would become a horizontal line. X would no longer influence the output of the regression model.

2.3

Because the P-value of 0.91 > \( \alpha \) > 0.05, we failed to reject the null hypothesis that the slope is equal to 0. The model in this case should not be used to infer the relationship between advertising expenditures and sales. Regardless of whether the model can be used for inference and the student should not use this model to draw conclusions.

2.4

a. The 99% confidence interval for \( \beta_1 \) is between 0.0054 and 0.0723. The 99% confidence interval does not include 0. Checking to see if the confidence interval includes 0 is equivalent to conducting a t-test to see if the slope \( \beta_1 \) is statistically different from 0. A non-zero \( \beta_1 \) would indicate a relationship betweeen ACT scores and end of freshmen year GPA.

confint(lmCH01PR19, interval = "confidence", level = 0.99)

##                0.5 %  99.5 %
## (Intercept) 1.273903 2.95420
## ACT         0.005386 0.07227

b. \[ H_0: \beta_1=0 \] \[ H_a: \beta_1\neq0 \] \[ t^* = \frac{\hat\beta_1-0}{SE(\beta_1)} \] We first evaluate t^*. Next find the corresponding p-value with n-2 degrees of freedom . Compare the p-value of 0.00291 with the \( \alpha \) of 0.01. If p-value is smaller than \( \alpha \), reject the \( H_0 \). If p-value is larger than \( \alpha \), we fail to reject the \( H_0 \). In this case, the \( H_0 \) is rejected at \( \alpha \) = 0.01. We reject the null hypothesis that the slope \( \beta_1 \) is equal to 0 in favor of the \( H_A \) that \( \beta_1\neq \) 0.

t = (summary(lmCH01PR19)$coef[2] - 0)/summary(lmCH01PR19)$coef[4]
t

## [1] 3.04

p.value = 2 * (1 - pt(t, lmCH01PR19$df))
p.value

## [1] 0.002917

c. The p-value was 0.0029. It was compared to \( \alpha \) to decide whether or not we reject the \( H_0 \). Assuming that the null hypothesis is true, the p-value is the chance of observing a test statistics as extreme as, or more extreme than the one actually observed. Alternatively, there is a 0.29% chance that this sample or a more extreme one arises from chance alone.

2.5

a. The 90% confidence interval for the mean service time when the number of copiers serviced increases by one is between 14.2231 and 15.8474.

confint(lmCH01PR20, interval = "confidence", level = 0.9)[2, ]

##   5 %  95 % 
## 14.22 15.85

b. \[ H_0: \beta_1=0 \] \[ H_a: \beta_1\neq0 \] \[ t^* = \frac{\hat\beta_1-0}{SE(\beta_1)} \] We first evaluate t^*. Next find the corresponding p-value with n-2 degrees of freedom . Compare the p-value of 0 with the \( \alpha \) of 0.1. If p-value is smaller than \( \alpha \), reject the \( H_0 \). If p-value is larger than \( \alpha \), we fail to reject the \( H_0 \). In this case, the \( H_0 \) is rejected at \( \alpha \) = 0.1. We reject the null hypothesis that the slope \( \beta_1 \) is equal to 0 in favor of the \( H_A \) that \( \beta_1\neq \) 0.

t = (summary(lmCH01PR20)$coef[2] - 0)/summary(lmCH01PR20)$coef[4]
t

## [1] 31.12

p.value = 2 * (1 - pt(t, lmCH01PR20$df))
p.value

## [1] 0

c. Results from parts (a) and (b) are consistent. The confidence interval calculated in part (a) for \( \beta_1 \) does not include the value of 0, which is consistent with the result from part (b) rejecting the null hypothesis that \( \beta_1 \) = 0.

d. \[ H_0: \beta_1<=14 \] \[ H_a: \beta_1>14 \] \[ t^* = \frac{\hat\beta_1-14}{SE(\beta_1)} \]

e. We were able to determine earlier than the \( \beta_0 \) is not statistically significant from 0. The \( \beta_0 \) has a negative value of -0.5802 which really does not make sense in the first place. No interpretation should be drawn from \( \beta_0 \).