Homework #1 36-617

1.11

The regression function relating production output by an employee after taking a training program \( (Y) \) to the production output before the training program \( (X) \) is \( E\{Y\} = 20 + .95X \), where \( X \) ranges from \( 40 \) to \( 100 \). An observer concludes that the training program does not raise production output on the average because \( \beta_0 \) is not greater than \( 1.0 \). Comment

The fact that \( \beta_0 \) is not greater than \( 1.0 \) implies that at some point, when \( X \) is large enough, productivity after the program will not be greater than productivity before the program. But, since the data range is from \( 40 \) to \( 100 \), and this is the range for which the model was built for and we can only use the model to make predictions between that range; we will never reach the point where \( X \) is large enough so that productivity after the program doesn't improve. The following plot shows that the value of productivity after the program (red line) will always be greater than the productivity before the program (blue line) for the values between \( 40 \) and \( 100 \).

data <- data.frame(emp = 40:100, before = 40:100, after = 20 + 0.95 * (40:100))
plot(data$after ~ data$emp, main = "Productivity before vs after program", type = "l", 
    col = "red", xlab = "", ylab = "Productivity")
abline(a = 0, b = 1, col = "blue")

plot of chunk unnamed-chunk-1

1.19

Grade point average. The director of admissions of a small college selected \( 120 \) students at random from the new freshman class in a study to determine whether a student's grade point average (GPA) at the end of the freshman year \( (Y) \) can be predicted from the ACT test score \( (X) \). The results of the study follow. Assume that first-order regression model (1.1) is appropriate.

a. Obtain the least squares estimates of \( \beta_0 \) and \( \beta_1 \). and state the estimated regression function.

data = read.table("CH01PR19.txt", header = FALSE, col.names = c("GPA", "ACT"))
fit <- lm(data$GPA ~ data$ACT)
fit

## 
## Call:
## lm(formula = data$GPA ~ data$ACT)
## 
## Coefficients:
## (Intercept)     data$ACT  
##      2.1140       0.0388

The estimated regression function is: \[ \beta_0 = 2.11405 \\ \beta_1 = 0.03883 \\ \widehat{Y} = 2.11405 + 0.03883X \]

b. Plot the estimated regression function and the data.Does the estimated regression function appear to fit the data well?

plot(data$GPA ~ data$ACT, main = "Estimated vs Data", xlab = "ACT", ylab = "GPA", 
    col = "blue")
abline(fit, col = "red")

plot of chunk unnamed-chunk-3

According to the plot, the estimated regression function is the line that matches the closest the given data. But, data is too spread out, the model is going to have problems giving accurate predictions because there is a lot of variance in the data. c. Obtain a point estimate of the mean freshman GPA for students with ACT test score \( X = 30 \).

Y = fit$coefficients[[2]] * 30 + fit$coefficients[[1]]

\[ \widehat{Y}_n = 3.27895 \] d. What is the point estimate of the change in the mean response when the entrance test score increases by one point? \( \beta_1 \) represents the slope of the estimated regression line and therefore it indicates the change in the mean response when X increases by one measurement. \[ \beta_1 = 0.03883 \]

1.20

Copier maintenance. The Tri-City Office Equipment Corporation sells an imported copier on a franchise basis and performs preventive maintenance and repair service on this copier. The data below have been collected from \( 45 \) recent calls on users to perform routine preventive maintenance service; for each call, \( X \) is the number of copiers serviced and \( Y \) is the total number of minutes spent by the service person. Assume that first-order regression model (1.1) is appropriate. a. Obtain the estimated regression function.

data = read.table("CH01PR20.txt", header = FALSE, col.names = c("time", "nocopiers"))
fit <- lm(data$time ~ data$nocopiers)
fit

## 
## Call:
## lm(formula = data$time ~ data$nocopiers)
## 
## Coefficients:
##    (Intercept)  data$nocopiers  
##          -0.58           15.04

\[ \widehat{Y} = -0.5802 + 15.0352X \] b. Plot the estimated regression function and the data. How well does the estimated regression function fit the data?

plot(data$time ~ data$nocopiers, main = "Estimated vs Data", xlab = "Number of copiers", 
    ylab = "Time", col = "blue")
abline(fit, col = "red")

plot of chunk unnamed-chunk-6

According to the plot, the estimated regression function matches very well the data. Almost all levels of X show the same spread and the line touches at least one point(which in most cases is close to the mean of Y) in each level of X.

c. Interpret \( \beta_0 \) in your estimated regression function. Does \( \beta_0 \) provide any relevant information here? Explain. \( \beta_0 \) often provides the value of Y when X=0. But it this case it doesn't make sense to service 0 copiers, so \( \beta_0 \) provides no additional information than modeling the regression function.

d. Obtain a point estimate of the mean service time when X = 5 copiers are serviced.

Y = fit$coefficients[[2]] * 5 + fit$coefficients[[1]]

\[ \widehat{Y}_n = 74.5958 \]

1.22

Plastic hardness. Refer to Problems 1.3 and 1.14. Sixteen batches of the plastic were made, and from each batch one test item was molded. Each test item was randomly assigned to one of the four predetermined time levels, and the hardness was measured after the assigned elapsed time. The results are shown below; \( X \) is the elapsed time in hours and \( Y \) is hardness in Brinell units. Assume that first-order regression model ( 1.1) is appropriate.

a. Obtain the estimated regression function. Plot the estimated regression function and the data. Does a linear regression function appear to give a good fit here?

data = read.table("CH01PR22.txt", header = FALSE, col.names = c("hardness", 
    "time"))
fit <- lm(data$hardness ~ data$time)
fit

## 
## Call:
## lm(formula = data$hardness ~ data$time)
## 
## Coefficients:
## (Intercept)    data$time  
##      168.60         2.03

plot(data$hardness ~ data$time, main = "Estimated vs Data", xlab = "Time(h)", 
    ylab = "Hardness(Brinell)", col = "blue")
abline(fit, col = "red")

plot of chunk unnamed-chunk-8

\[ \widehat{Y} = 168.600000 + 2.034375X \] According to the plot, the estimated regression function matches very well the data. Almost all levels of X show the same spread and the line touches at least one point(which in most cases is close to the mean of Y) in each level of X.

b. Obtain a point estimate of the mean hardness when X = 40 hours.

Y = fit$coefficients[[2]] * 40 + fit$coefficients[[1]]

\[ \widehat{Y}_h = 249.975 \]

c. Obtain a point estimate of the change in mean hardness when X increases by 1 hour. \[ \beta_1 = 2.034375 \] \( \beta_1 \) represents the slope of the estimated regression line and therefore it indicates the change in the mean response when X increases by one measurement.

1.27

Muscle mass. A person's muscle mass is expected to decrease with age. To explore this relationship in women, a nutritionist randomly selected \( 15 \) women from each 10-year age group, beginning with age \( 40 \) and ending with age \( 79 \). The results follow; \( X \) is age, and \( Y \) is a measure of muscle mass. Assume that first-order regression model (1.1) is appropriate. a. Obtain the estimated regression function. Plot the estimated regression function and the data. Does a linear regression function appear to give a good fit here? Does your plot support the anticipation that muscle mass decreases with age?

data = read.table("CH01PR27.txt", header = FALSE, col.names = c("mass", "age"))
fit <- lm(data$mass ~ data$age)
fit

## 
## Call:
## lm(formula = data$mass ~ data$age)
## 
## Coefficients:
## (Intercept)     data$age  
##      156.35        -1.19

plot(data$mass ~ data$age, main = "Estimated vs Data", xlab = "Age", ylab = "Mass", 
    col = "blue")
abline(fit, col = "red")

plot of chunk unnamed-chunk-10

\[ \widehat{Y} = 156.35 - 1.19X \] According to the plot, the estimated regression function matches data quite well. Spread seems to very depending on the level of X (age) but not considerably, probably a better fit would be a multimodal line. The plot supports the theory that muscle mass decreases with age.

b. Obtain the following: (l) a point estimate of the difference in the mean muscle mass for women differing in age by one year, (2) a point estimate of the mean muscle mass for women aged \( X = 60 \) years, (3) the value of the residual for the eighth case, (4) a point estimate of \( \sigma^2 \) (1) \( \beta_1 = -1.19 \) represents the slope of the estimated regression line and therefore it indicates the change in the mean response when X increases by one measurement. (2) \( \widehat{Y}_h = 84.95 \)

Y = fit$coefficients[[2]] * 60 + fit$coefficients[[1]]

(3) \( e_8 = 4.4433 \)

residuals(fit)

##        1        2        3        4        5        6        7        8 
##   0.8232  -1.5567  -3.4168  11.3932  -6.7968  11.4433  -8.4168   4.4433 
##        9       10       11       12       13       14       15       16 
##  -7.2268   2.7732   0.6332   6.5832  -3.1768  11.0132  -5.3668  -3.8968 
##       17       18       19       20       21       22       23       24 
##   2.4832   7.2932  -4.1368 -10.5168   2.9132   4.7232  -0.4668   2.7232 
##       25       26       27       28       29       30       31       32 
##   7.9132  -0.9468 -16.1368   8.3432 -10.1368   4.4832  -2.4269  -8.3768 
##       33       34       35       36       37       38       39       40 
##  -8.9468  -1.3768   2.6232  -9.1869 -13.8069   9.0031  -5.9468   9.0031 
##       41       42       43       44       45       46       47       48 
##  -5.9969  -0.2369  -7.7568  13.9531  -5.4269   5.4731  -9.5269  -1.5269 
##       49       50       51       52       53       54       55       56 
##   7.3331  -8.0469  -5.4769   8.0931  23.4731  -0.5269  10.1431  12.9031 
##       57       58       59       60 
## -12.7169  -9.9069  -0.6669   8.0931

(4) \( MSE = 66.8 \)

a = residuals(fit)
MSE = sum(a^2)/(length(a) - 2)

1.29

Refer to regression model (1.1). Assume that \( X = 0 \) is within the scope of the model. What is the implication for the regression function if \( \beta_0 = 0 \) so that the model is \( Y_i = \beta_1X_i + \varepsilon_i \)? How would the regression function plot on a graph? It means that the estimated value of Y when X = 0, is Y = 0. The regression function would cross the origin in a plot.

1.30

Refer to regression model (l.l). What is the implication for the regression function if \( \beta_1 = 0 \) so that the model is \( Y_i = \beta_0 + \varepsilon_i \)? How would the regression function plot on a graph? It means that Y is not affected by X, the regression line is a horizontal line.