Jerry Zhang
36-617 Applied Linear Models
Dr. T. Gaugler
Due September 4, 2013
setwd("C:/Users/vloei_000/Dropbox/School/Regression/PS 1")
## Error: cannot change working directory
The training program does increase production output for all values of X ranging from 40 to 100. The \( \beta_1 \) of 0.95 cannot be individually evaluated without considering the \( \beta_0 \) of 20. By plotting the production output after the training, we see that production is increased for all X within the range.
X = seq(40, 100, 5)
Y = X * 0.95 + 20
plot(X, Y, type = "l", main = "Production Output After Training", xlab = "Pre-training Production Output",
ylab = "Post-training Production Output")
In fact, we see that even an employee with 40 pre-training production output will have a production output of 58.
dataCH01PR19 = read.table("CH01PR19.txt")
names(dataCH01PR19) = c("GPA", "ACT")
attach(dataCH01PR19)
lmCH01PR19 = lm(GPA ~ ACT)
summary(lmCH01PR19)
##
## Call:
## lm(formula = GPA ~ ACT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7400 -0.3383 0.0406 0.4406 1.2274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1140 0.3209 6.59 1.3e-09 ***
## ACT 0.0388 0.0128 3.04 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.623 on 118 degrees of freedom
## Multiple R-squared: 0.0726, Adjusted R-squared: 0.0648
## F-statistic: 9.24 on 1 and 118 DF, p-value: 0.00292
a. The \( \beta_0 \) and \( \beta_1 \) are 2.114 and 0.0388 respectively.
b. Scatterplot of data and regression line. The regression fits the data fairly well and has a R2 value of 0.0726. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05.
plot(ACT, GPA, main = "Predicting GPA from ACT Score", ylab = "GPA", xlab = "ACT")
abline(lmCH01PR19)
c. The point estimate of mean freshmen GPA for students with ACT test score of 30 is 3.2789. The 95% prediction interval is between 2.6576 and 3.2012.
predict(lmCH01PR19, data.frame(ACT = 30), interval = "predict")
## fit lwr upr
## 1 3.279 2.033 4.525
d. The point estimate of the change in mean response when ACT score score is increased by one point is equal to \( \beta_1 \), which has the value: 0.0388. The prediction interval for \( \beta_1 \) is between 0.0135 and 0.0641.
confint(lmCH01PR19, interval = "prediction")
## 2.5 % 97.5 %
## (Intercept) 1.47859 2.74951
## ACT 0.01353 0.06412
detach(dataCH01PR19)
dataCH01PR20 = read.table("CH01PR20.txt")
names(dataCH01PR20) = c("Time", "N")
attach(dataCH01PR20)
lmCH01PR20 = lm(Time ~ N)
summary(lmCH01PR20)
##
## Call:
## lm(formula = Time ~ N)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.772 -3.737 0.333 6.333 15.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.580 2.804 -0.21 0.84
## N 15.035 0.483 31.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.91 on 43 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.957
## F-statistic: 969 on 1 and 43 DF, p-value: <2e-16
a. The regression function is Time = -0.5802 + 15.0352N + \( \epsilon\ \), where N is the number of copiers serviced and Time is the total number of minutes spent by the service person.
b. Scatterplot of data and regression line. The regression does not fit the data very well considering that the intercept term is non-significant at \( \alpha\ \) < 0.05. See part c.
plot(N, Time, main = "Maintenance Time verus Number of Copiers Serviced", ylab = "Time (Min)",
xlab = "Number of Copiers Serviced")
abline(lmCH01PR20)
c. The \( \beta_0 \) has a value of -0.5802 and a corresponding p-value of 0.8371. We thus fail to reject the null hypothesis that \( \beta_0 \) = 0. It would not be right to draw interpretations from \( \beta_0 \).
d. The point estimate of of mean service time when N = 5 copiers are serviced is 74.5961. We should not put much weight in this point estimate since the intercept term has been shown to be not significant.
predict(lmCH01PR20, data.frame(N = 5), interval = "predict")
## fit lwr upr
## 1 74.6 56.42 92.77
detach(dataCH01PR20)
dataCH01PR22 = read.table("CH01PR22.txt")
names(dataCH01PR22) = c("Hardness", "Hour")
attach(dataCH01PR22)
lmCH01PR22 = lm(Hardness ~ Hour)
summary(lmCH01PR22)
##
## Call:
## lm(formula = Hardness ~ Hour)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.150 -2.219 0.162 2.688 5.575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.6000 2.6570 63.5 < 2e-16 ***
## Hour 2.0344 0.0904 22.5 2.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.23 on 14 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.971
## F-statistic: 507 on 1 and 14 DF, p-value: 2.16e-12
a. The regression function is \( Y_i = 168.6+2.0344·X_i \), where Y is the hardness in Brinell units and X is the elapsed time in hour. The regression fits the data very well and has a R2 value of 0.9731. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05.
plot(Hour, Hardness, main = "Hardness versus Time", xlab = "Time (Hour)", ylab = "Hardness (Brinell)")
abline(lmCH01PR22)
b. The point estimate of the mean hardness at 40 hours is 249.975 Brinell units, with a 95% prediction interval between 242.4562 to 257.4938.
predict(lmCH01PR22, data.frame(Hour = 40), interval = "predict")
## fit lwr upr
## 1 250 242.5 257.5
c. When time is increased by 1 hour, the point estimate of the change in mean hardness increases by \( \beta_1 \) = 2.0344. The 95% prediction interval is between 1.8405 and 2.2283.
confint(lmCH01PR22, interval = "prediction")
## 2.5 % 97.5 %
## (Intercept) 162.90 174.299
## Hour 1.84 2.228
detach(dataCH01PR22)
dataCH01PR27 = read.table("CH01PR27.txt")
names(dataCH01PR27) = c("Muscle", "Age")
attach(dataCH01PR27)
lmCH01PR27 = lm(Muscle ~ Age)
summary(lmCH01PR27)
##
## Call:
## lm(formula = Muscle ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.137 -6.197 -0.597 6.761 23.473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.3466 5.5123 28.4 <2e-16 ***
## Age -1.1900 0.0902 -13.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.17 on 58 degrees of freedom
## Multiple R-squared: 0.75, Adjusted R-squared: 0.746
## F-statistic: 174 on 1 and 58 DF, p-value: <2e-16
a. The regression function is \( Y_i = 156.3466+-1.19·X_i + \epsilon \), where Y is a measure of muscle mass and X is age in year. The regression fits the data very well and has a R2 value of 0.7501. From the F-statistics and its associated P-value above, we see that all coefficients are significantly different from 0 at \( \alpha\ \) < 0.05. There is a clear relationship between decreasing muscle mass and age.
plot(Age, Muscle, main = "Muscle Mass versus Age", xlab = "Age (Year)", ylab = "Muscle Mass (unit)")
abline(lmCH01PR27)
b. (1) The point estimate for the difference in mean muscle mass for women differing in age by one year is -1.19, with a 95% prediction interval between -1.0094 and -1.3705
confint(lmCH01PR27, interval = "prediction")
## 2.5 % 97.5 %
## (Intercept) 145.313 167.381
## Age -1.371 -1.009
(2) The point estimate of the mean muscle mass for women aged 60 years is 84.9468 units, with a 95% prediction interval between 68.4507 and 101.443 units.
predict(lmCH01PR27, data.frame(Age = 60), interval = "prediction")
## fit lwr upr
## 1 84.95 68.45 101.4
(3) The value of the 8th residual is 4.4433
lmCH01PR27$resid[8]
## 8
## 4.443
(4) A point estimate of the \( \sigma^2 \) is 66.8008
anova(lmCH01PR27)
## Analysis of Variance Table
##
## Response: Muscle
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 11627 11627 174 <2e-16 ***
## Residuals 58 3874 67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach(dataCH01PR27)
If \( \beta_0 \) was forced to be 0 so that the model becomes \( Y_i = \beta_1 ·X_i + \epsilon_i \), the regression line is forced to go through the origin. The X and Y intercept would both be at (0,0).
If \( \beta_1 \) was forced to be 0 so that the model becomes \( Y_i = \beta_0 + \epsilon_i \), the regression line would become a horizontal line. X would no longer influence the output of the regression model.
Because the P-value of 0.91 > \( \alpha \) > 0.05, we failed to reject the null hypothesis that the slope is equal to 0. The model in this case should not be used to infer the relationship between advertising expenditures and sales. Regardless of whether the model can be used for inference and the student should not use this model to draw conclusions.
a. The 99% confidence interval for \( \beta_1 \) is between 0.0054 and 0.0723. The 99% confidence interval does not include 0. Checking to see if the confidence interval includes 0 is equivalent to conducting a t-test to see if the slope \( \beta_1 \) is statistically different from 0. A non-zero \( \beta_1 \) would indicate a relationship betweeen ACT scores and end of freshmen year GPA.
confint(lmCH01PR19, interval = "confidence", level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 1.273903 2.95420
## ACT 0.005386 0.07227
b. \[ H_0: \beta_1=0 \] \[ H_a: \beta_1\neq0 \] \[ t^* = \frac{\hat\beta_1-0}{SE(\beta_1)} \] We first evaluate t*. Next find the corresponding p-value with n-2 degrees of freedom . Compare the p-value of 0.00291 with the \( \alpha \) of 0.01. If p-value is smaller than \( \alpha \), reject the \( H_0 \). If p-value is larger than \( \alpha \), we fail to reject the \( H_0 \). In this case, the \( H_0 \) is rejected at \( \alpha \) = 0.01. We reject the null hypothesis that the slope \( \beta_1 \) is equal to 0 in favor of the \( H_A \) that \( \beta_1\neq \) 0.
t = (summary(lmCH01PR19)$coef[2] - 0)/summary(lmCH01PR19)$coef[4]
t
## [1] 3.04
p.value = 2 * (1 - pt(t, lmCH01PR19$df))
p.value
## [1] 0.002917
c. The p-value was 0.0029. It was compared to \( \alpha \) to decide whether or not we reject the \( H_0 \). Assuming that the null hypothesis is true, the p-value is the chance of observing a test statistics as extreme as, or more extreme than the one actually observed. Alternatively, there is a 0.29% chance that this sample or a more extreme one arises from chance alone.
a. The 90% confidence interval for the mean service time when the number of copiers serviced increases by one is between 14.2231 and 15.8474.
confint(lmCH01PR20, interval = "confidence", level = 0.9)[2, ]
## 5 % 95 %
## 14.22 15.85
b. \[ H_0: \beta_1=0 \] \[ H_a: \beta_1\neq0 \] \[ t^* = \frac{\hat\beta_1-0}{SE(\beta_1)} \] We first evaluate t*. Next find the corresponding p-value with n-2 degrees of freedom . Compare the p-value of 0 with the \( \alpha \) of 0.1. If p-value is smaller than \( \alpha \), reject the \( H_0 \). If p-value is larger than \( \alpha \), we fail to reject the \( H_0 \). In this case, the \( H_0 \) is rejected at \( \alpha \) = 0.1. We reject the null hypothesis that the slope \( \beta_1 \) is equal to 0 in favor of the \( H_A \) that \( \beta_1\neq \) 0.
t = (summary(lmCH01PR20)$coef[2] - 0)/summary(lmCH01PR20)$coef[4]
t
## [1] 31.12
p.value = 2 * (1 - pt(t, lmCH01PR20$df))
p.value
## [1] 0
c. Results from parts (a) and (b) are consistent. The confidence interval calculated in part (a) for \( \beta_1 \) does not include the value of 0, which is consistent with the result from part (b) rejecting the null hypothesis that \( \beta_1 \) = 0.
d. \[ H_0: \beta_1<=14 \] \[ H_a: \beta_1>14 \] \[ t^* = \frac{\hat\beta_1-14}{SE(\beta_1)} \]
e. We were able to determine earlier than the \( \beta_0 \) is not statistically significant from 0. The \( \beta_0 \) has a negative value of -0.5802 which really does not make sense in the first place. No interpretation should be drawn from \( \beta_0 \).