Part 1 Chapter 4, problems 4.16 and 4.17

Refer to Copier Maintenance Problem – Assume that linear regression through the origin is approrpiate.

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.2

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.2

copier.maint <- read.delim("CopierMaintenance.txt", header = FALSE, sep = " ")
colnames(copier.maint) <-c("NA","Time", "Number")
copier.maint

##    NA Time Number
## 1  NA   20      2
## 2  NA   60      4
## 3  NA   46      3
## 4  NA   41      2
## 5  NA   12      1
## 6  NA  137     10
## 7  NA   68      5
## 8  NA   89      5
## 9  NA    4      1
## 10 NA   32      2
## 11 NA  144      9
## 12 NA  156     10
## 13 NA   93      6
## 14 NA   36      3
## 15 NA   72      4
## 16 NA  100      8
## 17 NA  105      7
## 18 NA  131      8
## 19 NA  127     10
## 20 NA   57      4
## 21 NA   66      5
## 22 NA  101      7
## 23 NA  109      7
## 24 NA   74      5
## 25 NA  134      9
## 26 NA  112      7
## 27 NA   18      2
## 28 NA   73      5
## 29 NA  111      7
## 30 NA   96      6
## 31 NA  123      8
## 32 NA   90      5
## 33 NA   20      2
## 34 NA   28      2
## 35 NA    3      1
## 36 NA   57      4
## 37 NA   86      5
## 38 NA  132      9
## 39 NA  112      7
## 40 NA   27      1
## 41 NA  131      9
## 42 NA   34      2
## 43 NA   27      2
## 44 NA   61      4
## 45 NA   77      5

a.

Obtain the estimated regression function.

wait.time <- copier.maint[,2]
copier.number <- copier.maint[,3]
copier.lm <- lm(wait.time~copier.number-1)
copier.lm

## 
## Call:
## lm(formula = wait.time ~ copier.number - 1)
## 
## Coefficients:
## copier.number  
##         14.95

\({\hat{Y}=14.95X}\)

b.

Estimate \({\beta_1}\) with a 90 percent confidence interval. Interpret your interval estimate.

ci.copier <- confint.lm(copier.lm, level = 0.90)
ci.copier

##                    5 %     95 %
## copier.number 14.56678 15.32767

\({s[b_1]=0.226242, t(0.95;44)=1.68023}\)
\({14.9472\pm1.68023(0.226424)}\)
\({14.567\le\beta_1\le15.328}\)
We can confidently estimate \({\beta_1}\) as being between 14.567 and 15.328 90% of the time.

c.

Predict the service time on a new call in which six copiers are to be serviced. Use a 90 percent confidence interval.

newdata <- data.frame(copier.number=6)
copier.pred <- predict(copier.lm, newdata, level=0.90, interval = "prediction", se.fit = TRUE)
copier.pred

## $fit
##        fit      lwr      upr
## 1 89.68338 74.69559 104.6712
## 
## $se.fit
## [1] 1.358546
## 
## $df
## [1] 44
## 
## $residual.scale
## [1] 8.816022

\({\hat{Y_h}=89.6834}\)
\({s[pred]=8.92008, 89.6834\pm1.68023(8.92008), 74.696\le Y_{h(new)}\le104.671}\)

d.

Plot the fitted regression line and the data. Does the linear regression function through the origin appear to be a good fit here?

plot(copier.number,wait.time, xlab= "Number of Copiers", ylab= "Service Time")
abline(copier.lm)

The linear regression appears to have a good fit with the data.

e.

Obtain the residuals \({e_i}\). Do they sum to zero? Plot the residuals against the fitted values \({\hat{Y_i}}\). What conclusions can be drawn from your plot?

copier.res <- resid(copier.lm)
copier.res

##           1           2           3           4           5           6 
##  -9.8944591   0.2110818   1.1583113  11.1055409  -2.9472296 -12.4722955 
##           7           8           9          10          11          12 
##  -6.7361478  14.2638522 -10.9472296   2.1055409   9.4749340   6.5277045 
##          13          14          15          16          17          18 
##   3.3166227  -8.8416887  12.2110818 -19.5778364   0.3693931  11.4221636 
##          19          20          21          22          23          24 
## -22.4722955  -2.7889182  -8.7361478  -3.6306069   4.3693931  -0.7361478 
##          25          26          27          28          29          30 
##  -0.5250660   7.3693931 -11.8944591  -1.7361478   6.3693931   6.3166227 
##          31          32          33          34          35          36 
##   3.4221636  15.2638522  -9.8944591  -1.8944591 -11.9472296  -2.7889182 
##          37          38          39          40          41          42 
##  11.2638522  -2.5250660   7.3693931  12.0527704  -3.5250660   4.1055409 
##          43          44          45 
##  -2.8944591   1.2110818   2.2638522

sum(copier.res)

## [1] -5.862797

no, they do not sum to zero.

plot(copier.lm, which=c(1))
abline(0,0)

The residuals are closest to each other (summing to zero) around the median of the fitted values. But the regression moves away from the mean, the residuals no longer sum to zero. Therefore, this linear model functions well for the mean of the data, but isn’t as representative for the whole of the data.

f.

Conduct a formal test for lack of fit of linear regression through the origin; use \({\alpha = 0.01}\). State the alternatives, decision rule, and conclusion. What is the P-value of the test.

alpha <- 0.01
n <- length(wait.time)
c <- length(unique(copier.number)) 
F.test.copier <- qf(1-alpha,c-1,n-c)
F.test.copier

## [1] 2.963012

pureErrorAnova(copier.lm)

## Analysis of Variance Table
## 
## Response: wait.time
##               Df Sum Sq Mean Sq   F value Pr(>F)    
## copier.number  1 338704  338704 4237.3465 <2e-16 ***
## Residuals     44   3420      78                     
##  Lack of fit   9    622      69    0.8648 0.5644    
##  Pure Error   35   2798      80                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternatives:
\({H_0: E[Y]=\beta_1X}\)
\({H_0: E[Y]\ne\beta_1X}\)
\(F(1-\alpha;c-2,n-c)=2.963012\)
Decision rule:
If \({F^*\le2.963012}\), conclude \({H_0}\)
If \({F^*>2.963012}\), conclude \({H_A}\)
Conclusion:
Since \({F^*=0.8648\le2.963012}\); therefore, accept \({H_0}\)

df1 <- c-1
df2 <- n-c
p.val.copier <- pf(1-alpha,df1,df2)
p.val.copier

## [1] 0.5343595

P-value = 0.5343595

Part 2, Chapter 4, problem 4.1 and 4.3

When joint confidence intervals for \({\beta_0}\) and \({\beta_1}\) are developed by the Bonferroni method with a family confidence coefficient of 90 percent, does this imply that 10 percent of the time the confidence interval for \({\beta_0}\) will be incorrect? That 5 percent of the time the confidence interval for \({\beta_0}\) will be incorrect and 5 percent of the time that for \({\beta_1}\) the confidence interval will be incorrect? Discuss.

A family confidence coefficient of 90 percent implies that in total, 10 percent of the time, either \({\beta_0}\) or \({\beta_1}\) will be incorrect. The exact percentages for each confidence interval are determined by the user, but do not have to be equivalent. For example, the confidence interval for \({\beta_0}\) might be 7 percent and the confidence interval for \({\beta_1}\) is 3 percent.

Refer to the Copier Maintenance problem. # a. Will \({b_0}\) and \({b_1}\) tend to err in the same direction or in opposite directions here? Explain.

mean(copier.number)

## [1] 5.111111

sigma.sq<- -mean(copier.number)*var(wait.time,copier.number)
sigma.sq

## [1] -594.5926

\({b_0}\) and \({b_1}\) tend to err in opposite directions, because \(\bar{X}=5.111\), a positive number, thus the covariance is a negative number which suggests that \(b_0\) and \(b_1\) will err in opposite directions and over/under estimate \(\beta_0\) and \(\beta_1\).

b.

Obtain Bonferroni joint confidence intervals for \({\beta_0}\) abd \({\beta_1}\), using a 95 percent family confidence coefficient.

bonf.copier.lm <- lm(wait.time~copier.number)
bonf.copier.lm

## 
## Call:
## lm(formula = wait.time ~ copier.number)
## 
## Coefficients:
##   (Intercept)  copier.number  
##       -0.5802        15.0352

joint.fam <- 2 #number of joint betas we want.
alpha <- 0.05
joint.level <- alpha/joint.fam
bonf.int <- confint(bonf.copier.lm, level = 1-joint.level)
bonf.int

##                  1.25 %   98.75 %
## (Intercept)   -7.092642  5.932329
## copier.number 13.913221 16.157275

\(b_0=-0.5802\)
\(b_1=15.0352\)
\({-7.093\le\beta_0\le5.932}\)
\({13.913\le\beta_1\le16.157}\)

c.

A consultant has suggested that \({\beta_0}\) should be 0 and \({\beta_1}\) should equal 14.0. Do your joint confidence intervals in part (b) support this view? Yes, a \({\beta_0}\) of 0 and \({\beta_1}\) of 14 fall within the joint confidence intervals found in part (b).

Part 3 Chapter 4, problem 4.12 and 4.13

Typographical errors. Shown below are the number of galelys for a manuscript (X) and the total dollar cost of correcting typographical errors (Y) in a random sample of recent orders handled by a firm specializing in technical manuscripts. Since Y involves vaiable costs only, an analyst wished to determine whether regression-through-the-origin model (4.10) is appropriate for studying the relation between the two variables.

galleys.x <- c(7, 12, 10, 10, 14, 25, 30, 25, 18, 10, 4, 6)
cost.y <- c(128, 213, 191, 178, 250 , 446, 540, 457, 324, 177, 75, 107)

a.

Fit regression model (4.10) and state the estimated regression function.

typos.lm <- lm(cost.y~galleys.x-1)
typos.lm

## 
## Call:
## lm(formula = cost.y ~ galleys.x - 1)
## 
## Coefficients:
## galleys.x  
##     18.03

\({\hat{Y_h}=18.03X}\)

b.

Plot the estimated regression function and the data. Does a linear regression function through the origin appear to provide a good fit here? Comment.

plot(galleys.x,cost.y,xlab= "Galleys", ylab="Cost")
abline(typos.lm)

The regression line through the origin appears to have a good fit.

c.

In estimating costs of handling prospective orders, management has used a standard of 17.50 per galley for the cost of correcting typographical errors. Test whether or not this standard should be revised; use \({\alpha=0.02}\). State the alternatives, decision rule, and conclusion.

historical.norm <- data.frame(galleys.x=1)
alpha <- 0.02
typos.int <- predict.lm(typos.lm, newdata = historical.norm , interval = "confidence", level = 1-alpha)
typos.int

##       fit      lwr      upr
## 1 18.0283 17.81226 18.24435

Alternatives: \(H_0: E[Y]=\beta_{10}=17.50\)
\(H_0: E[Y]\ne\beta_{10}=17.50\)
CI: \(17.81226\le E[Y_{h}]\le18.24435\)
Decision rule:
If \(\beta_{10}\) falls within the confidence interval for \(E[Y_h]\), conclude \(H_0\) If \(\beta_{10}\) does not fall within the confidence interval for \(E[Y_h]\), conclude \(H_A\) Conclusion:
Since \({17.50<17.81226}\); therefore, accept \({H_A}\).

d.

Obtain a prediction interval for the correction cost on a forthcoming job involving 10 galleys. Use a confidence coefficent of 98 percent.

newdata.galleys <- data.frame(galleys.x=10)
typos.pred <- predict(typos.lm, newdata.galleys, level=0.98, interval = "predict", se.fit = TRUE)  
typos.pred

## $fit
##       fit      lwr     upr
## 1 180.283 167.8441 192.722
## 
## $se.fit
## [1] 0.7948375
## 
## $df
## [1] 11
## 
## $residual.scale
## [1] 4.506806

\({\hat{Y_h}=180.283}\)
\({s[pred]=4.506806}\)
\({180.283\pm2.738769(4.506806)}\)
\({167.8441\le Y_{h(new)}\le192.722}\)

e.

Obtain the residuals \({e_i}\). Do they sum to zero? Plot the residuals against the fitted values \({\hat{Y_i}}\). What conclusions can be drawn from your plot.

typos.res <- typos.lm$residuals
typos.res

##          1          2          3          4          5          6 
##  1.8018663 -3.3396579 10.7169518 -2.2830482 -2.3962675 -4.7076205 
##          7          8          9         10         11         12 
## -0.8491446  6.2923795 -0.5094868 -3.2830482  2.8867807 -1.1698289

sum(typos.res)

## [1] 3.159876

They do not sum to zero.

plot(typos.lm, which = c(1))
abline(0,0)

This is a very scattered plot. This suggests that the fitted values are not close to the data.s

f.

Conduct a formal test for lack of fit of linear regression through the origin; use \({\alpha=0.01}\). State the alternatives, decision rule, and conclusion. What is the P-value of the test?

alpha <- 0.01
n.galleys <- length(cost.y)
c.galleys <- length(unique(galleys.x))
df1 <- c.galleys-1
df2 <- n.galleys-c.galleys
f.stat <- qf(1-alpha,c.galleys-1,n.galleys-c.galleys)
pureErrorAnova(typos.lm)

## Analysis of Variance Table
## 
## Response: cost.y
##              Df  Sum Sq Mean Sq    F value    Pr(>F)    
## galleys.x     1 1044939 1044939 17177.0725 9.794e-07 ***
## Residuals    11     223      20                         
##  Lack of fit  8      41       5     0.0841    0.9974    
##  Pure Error   3     182      61                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Alternatives: \(H_0: E[Y]=\beta_1X\)
\(H_0: E[Y]\ne\beta_1X\)
\(F(1-\alpha;1,n-1)=27.48918\)
Decision rule:
If \({F^*\le27.48918}\), conclude \({H_0}\)
If \({F^*>27.48918}\), conclude \({H_A}\)
Conclusion:
Since \(F^*=0.0841\le27.48918\); therefore, accept \(H_0\)

Part 4, Chapter 3, problem 3.17

Sales growth. A marketing researcher studied annual sales of a product that had been introduced 10 years ago. The data are as follows, where X is the year (coded) and Y is sales in thousands of units:

year.sales <- c(0:9)
sales.thousand <- c(98, 135, 162, 178, 221, 232, 283, 300, 374, 395)
Product.sales <- as.matrix(year.sales, sales.thousand)
library(apricom)

## Warning: package 'apricom' was built under R version 3.4.2

a.

Prepare a scatter plot of the data. Does a linear relation appear adequate here?

plot(year.sales,sales.thousand, xlab = "Sales Year", ylab = "Sales (in thousands)")

A linear relation possibly appears adequate here.

b.

Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y. Evaluate SSE for \({\lambda =.3, .4, .5, .6, .7}\). What transformation of Y is suggested?

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:alr3':
## 
##     forbes

sales.lm <- lm(sales.thousand~year.sales)
sales.bc <- boxcox(sales.lm)

#Evaluating SSE
sales.3 <- (1/(.3*(prod(sales.thousand)^(1/length(sales.thousand)))^(.3-1)))*(sales.thousand^.3-1)
SSE.3 <- anova(lm(sales.3~year.sales))
SSE.3$`Sum Sq`[2]

## [1] 1099.709

sales.4 <- (1/(.4*(prod(sales.thousand)^(1/length(sales.thousand)))^(.4-1)))*(sales.thousand^.4-1)
SSE.4 <- anova(lm(sales.4~year.sales))
SSE.4$`Sum Sq`[2]

## [1] 967.9088

sales.5 <- (1/(.5*(prod(sales.thousand)^(1/length(sales.thousand)))^(.5-1)))*(sales.thousand^.5-1)
SSE.5 <- anova(lm(sales.5~year.sales))
SSE.5$`Sum Sq`[2]

## [1] 916.4048

sales.6 <- (1/(.6*(prod(sales.thousand)^(1/length(sales.thousand)))^(.6-1)))*(sales.thousand^.6-1)
SSE.6 <- anova(lm(sales.6~year.sales))
SSE.6$`Sum Sq`[2]

## [1] 942.4498

sales.7 <- (1/(.7*(prod(sales.thousand)^(1/length(sales.thousand)))^(.7-1)))*(sales.thousand^.7-1)
SSE.7 <- anova(lm(sales.7~year.sales))
SSE.7$`Sum Sq`[2]

## [1] 1044.238

SSE.table <- cbind(c(.3,.4,.5,.6,.7),c(SSE.3$`Sum Sq`[2],SSE.4$`Sum Sq`[2],SSE.5$`Sum Sq`[2],SSE.6$`Sum Sq`[2],SSE.7$`Sum Sq`[2]))
SSE.table

##      [,1]      [,2]
## [1,]  0.3 1099.7093
## [2,]  0.4  967.9088
## [3,]  0.5  916.4048
## [4,]  0.6  942.4498
## [5,]  0.7 1044.2384

#Solving for Peak Lambda
peak.lambda <- sales.bc$x[which.max(sales.bc$y)]
peak.lambda

## [1] 0.5050505

The suggested transformation is where \(\lambda=0.5050505\).

c.

Use the transformation \({Y'=\sqrt{Y}}\) and obtain the estimated linear regression function for the transformed data.

sales.prime <- sqrt(sales.thousand)
sales.prime.lm <- lm(sales.prime~year.sales) 
sales.prime.lm

## 
## Call:
## lm(formula = sales.prime ~ year.sales)
## 
## Coefficients:
## (Intercept)   year.sales  
##      10.261        1.076

\({\hat{Y'}=10.26093+1.076X}\)

d.

Plot the estimated regression line and the transformed data. Does the regression line appear to be a good fit to the transformed data?

plot(year.sales,sales.prime, xlab = "Sales Year", ylab = "Square root of sales (in thousands)", main = "Transformed Sales Data")
abline(sales.prime.lm)

The linear regression appears to be a good fit for the transformed data.

e.

Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show?

sales.res <- sales.prime.lm$residuals
sales.fitted <- sales.prime.lm$fitted.values
plot(sales.prime.lm, which= c(1,2))

The residuals v. Fitted values plot points to the error in the linear regression lining up well with the differences between the expected values and the observed values. Additionally, the sum of the residuals is zero which supports the use of this transformation on the data for linear regression analysis.

The Q-Q plot helps us determine if the standardized residuals from the linear model are normally distributed. The points do not perfectly line up along the line y=x; however they appear to generally follow this line; therefore, we conclude that the residuals are normally distributed.

f.

Express the estimated regression function in the original units.

sales.prime.lm

## 
## Call:
## lm(formula = sales.prime ~ year.sales)
## 
## Coefficients:
## (Intercept)   year.sales  
##      10.261        1.076

\({\hat{Y'}=(10.261+1.076X)^2}\)

Part 5 Chapter 3, problem 3.30

Refer to Sales growth problem. # a. Divide the range of the predictor variable (coded years) into five bands of width 2.0, as follows: Band 1 ranges from \({X=-0.5}\) to \({X=1.5}\). Band 2 ranges from \({X=1.5}\) to \({X=3.5}\) and so on. Determine the median value of X and the median value of Y in each band and develop the band smooth by connecting the five pairs of medians by straight lines on a scatter plot of the data. Does the band smooth suggest that the regression relation is linear? Discuss.

year.sales

##  [1] 0 1 2 3 4 5 6 7 8 9

sales.thousand

##  [1]  98 135 162 178 221 232 283 300 374 395

band.1x <- median(c(year.sales[1],year.sales[2]))
band.2x <- median(c(year.sales[3],year.sales[4]))
band.3x <- median(c(year.sales[5],year.sales[6]))
band.4x <- median(c(year.sales[7],year.sales[8]))
band.5x <- median(c(year.sales[9],year.sales[10]))
band.1y <- median(c(sales.thousand[1],sales.thousand[2]))
band.2y <- median(c(sales.thousand[3],sales.thousand[4]))
band.3y <- median(c(sales.thousand[5],sales.thousand[6]))
band.4y <- median(c(sales.thousand[7],sales.thousand[8]))
band.5y <- median(c(sales.thousand[9],sales.thousand[10]))
band.1 <- cbind(band.1x,band.1y)
band.2 <- cbind(band.2x,band.2y)
band.3 <- cbind(band.3x,band.3y)
band.4 <- cbind(band.4x,band.4y)
band.5 <- cbind(band.5x,band.5y)
bands <- rbind(band.1,band.2,band.3,band.4, band.5)
plot(year.sales,sales.thousand)
lines(bands)

The median smoothing suggests a mostly linear relationship; however, the last few years of data changes to a more curvilinear.

b.

Create a series of secen overlapping neighborhoods of width 3.0 beginning with \({X=-0.5}\). The first neighborhood will range from \({X=-0.5}\) to \({X=2.5}\); the second neighborhood ranges from \({X=0.5}\) to \({X=3.5}\) and so on. For each of the seven overlapping neighborhoods, fit a linear regression function and obtain the fitted value \({\hat{Y_c}}\) at the center \({X_c}\) of the neighborhood. Develop a simplified verson of the lowess smooth by connecting the seven \({(X_c,\hat{Y_c})}\) pairs by straight lines on a scatter plot of the data.

neighbor.1x <- median(c(year.sales[1],year.sales[2], year.sales[3]))
neighbor.2x <- median(c(year.sales[2],year.sales[3], year.sales[4]))
neighbor.3x <- median(c(year.sales[3],year.sales[4], year.sales[5]))
neighbor.4x <- median(c(year.sales[4],year.sales[5], year.sales[6]))
neighbor.5x <- median(c(year.sales[5],year.sales[6], year.sales[7]))
neighbor.6x <- median(c(year.sales[6],year.sales[7], year.sales[8]))
neighbor.7x <- median(c(year.sales[7],year.sales[8], year.sales[9]))
neighbor.1y <- median(c(sales.thousand[1],sales.thousand[2], sales.thousand[3]))
neighbor.2y <- median(c(sales.thousand[2],sales.thousand[3], sales.thousand[4]))
neighbor.3y <- median(c(sales.thousand[3],sales.thousand[4], sales.thousand[5]))
neighbor.4y <- median(c(sales.thousand[4],sales.thousand[5], sales.thousand[6]))
neighbor.5y <- median(c(sales.thousand[5],sales.thousand[6], sales.thousand[7]))
neighbor.6y <- median(c(sales.thousand[6],sales.thousand[7], sales.thousand[8]))
neighbor.7y <- median(c(sales.thousand[7],sales.thousand[8], sales.thousand[9]))
neighbor.lm.1 <- lm(neighbor.1y~neighbor.1x)
neighbor.lm.1

## 
## Call:
## lm(formula = neighbor.1y ~ neighbor.1x)
## 
## Coefficients:
## (Intercept)  neighbor.1x  
##         135           NA

neighbor.lm.1$fitted.values

##   1 
## 135

neighbor.lm.2 <- lm(neighbor.2y~neighbor.2x)
neighbor.lm.2

## 
## Call:
## lm(formula = neighbor.2y ~ neighbor.2x)
## 
## Coefficients:
## (Intercept)  neighbor.2x  
##         162           NA

neighbor.lm.2$fitted.values

##   1 
## 162

neighbor.lm.3 <- lm(neighbor.3y~neighbor.3x)
neighbor.lm.3

## 
## Call:
## lm(formula = neighbor.3y ~ neighbor.3x)
## 
## Coefficients:
## (Intercept)  neighbor.3x  
##         178           NA

neighbor.lm.3$fitted.values

##   1 
## 178

neighbor.lm.4 <- lm(neighbor.4y~neighbor.4x)
neighbor.lm.4

## 
## Call:
## lm(formula = neighbor.4y ~ neighbor.4x)
## 
## Coefficients:
## (Intercept)  neighbor.4x  
##         221           NA

neighbor.lm.4$fitted.values

##   1 
## 221

neighbor.lm.5 <- lm(neighbor.5y~neighbor.5x)
neighbor.lm.5

## 
## Call:
## lm(formula = neighbor.5y ~ neighbor.5x)
## 
## Coefficients:
## (Intercept)  neighbor.5x  
##         232           NA

neighbor.lm.5$fitted.values

##   1 
## 232

neighbor.lm.6 <- lm(neighbor.6y~neighbor.6x)
neighbor.lm.6

## 
## Call:
## lm(formula = neighbor.6y ~ neighbor.6x)
## 
## Coefficients:
## (Intercept)  neighbor.6x  
##         283           NA

neighbor.lm.6$fitted.values

##   1 
## 283

neighbor.lm.7 <- lm(neighbor.7y~neighbor.7x)
neighbor.lm.7

## 
## Call:
## lm(formula = neighbor.7y ~ neighbor.7x)
## 
## Coefficients:
## (Intercept)  neighbor.7x  
##         300           NA

neighbor.lm.7$fitted.values

##   1 
## 300

neighbors<- rbind(neighbor.lm.1$fitted.values,neighbor.lm.2$fitted.values,neighbor.lm.3$fitted.values,neighbor.lm.4$fitted.values, neighbor.lm.5$fitted.values, neighbor.lm.6$fitted.values, neighbor.lm.7$fitted.values)
plot(neighbors, type = "l", xlab = "Neighborhood", ylab = "Neighborhood Fitted Values", main = "Simplified Lowess Smooth Curve")

c.

Obtain the 95 percent confidence band for the true regression line and plot it on the plot prepared in part (b). Does the simplified lowess smooth fall entirely within the confidence band for the regression line? What does this tell you about the appropriateness of the linear regression function?

sales.confidence.int <- predict(sales.prime.lm, interval = "confidence", level = 0.95, se.fit = TRUE)
confidence.ll <- cbind(c(1:10), sales.confidence.int$fit[,2])
confidence.ul <- cbind(c(1:10), sales.confidence.int$fit[,3])
plot(neighbors, type = "l", xlab = "Neighborhood", ylab = "Neighborhood Fitted Values", main = "Simplified Lowess Smooth Curve")
lines(confidence.ll, lty = "dotted")
lines(confidence.ul, lty = "dotted")

For the most part, my simplified Lowess Method remains within the confidence bands. Therefore, the regression function is appropriate.

Part 6 Chapter 4, problem 4.21-4.24

When the predictor variable is so coded that \({\bar{X}=0}\) and the normal error regression model (2.1) applies, are \({b_0}\) and \({b_1}\) independent? Are the joint confidence intervals for \({\beta_0}\) and \({\beta_1}\) then independent?

If equation 4.5 holds, then for \({\bar{X}=0}\), \({\sigma^2[b_0,b_1]=\bar{X}\sigma^2[b_1]=0*\sigma^2[b_1]}\) Hence, \({b_0}\) and \({b_1}\) are not correlated in this instance, therefore they are independent since the point estimates of \({\beta_0}\) and \({\beta_1}\) are independent, the joint confidence interval is also independent.

a.

Derive an extension of the Bonferroni inequality (4.2a) for the case of three statements, each with statement confidence coefficient \({1-\alpha}\).

\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 1-P(A_1 \cup A_2 \cup A_3)\)
\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 1-[P(A_1)+P(A_2)+P(A_3)-P(A_1 \cap A_2 \cap A_3)\)
\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 1-P(A_1)-P(A_2)-P(A_3)+P(A_1)P(A_2|A_1)+P(A_1)P(A_3|A_1)+P(A_2)P(A_3|A_2)\)
\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 1-(1-\alpha)-(1-\alpha)-(1-alpha)+(1-\alpha)P(A_2|A_1)+(1-\alpha)P(A_3|A_1)+(1-alpha)P(A_3|A_2)\)
\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 1-3(1-\alpha)+(1-\alpha)[P(A_2|A_1)+P(A_3|A_1)+P(A_3|A_2)]\)
\(P(\bar{A_1}\cap\bar{A_2}\cap\bar{A_3}) \geq 3\alpha-2+(1-\alpha)[P(A_2|A_1)+P(A_3|A_1)+P(A_3|A_2)]\)

b.

Show that for the fitted least squares regression line through the origin \({\sum{X_ie_i}=0}\).
\(\sum{X_i e_i}=\sum{X_i(Y_i-b_1X_i)}\)
\({\sum{X_i e_i}=\sum{X_iY_i-b_1X_i^2}}\)
\({\sum{X_i e_i}=\sum{X_iY_i}-b_1\sum{X_i^2}}\)
\({\sum{X_i e_i}=\sum{X_iY_i}-\frac{b_1\sum{X_iY_i}\sum{X_i^2}}{\sum{X_i^2}}}\)
\({\sum{X_i e_i}=\sum{X_iY_i}-\sum{X_iY_i}}\)
Hence, \({\sum{X_i e_i}=0}\)

c.

Show that \({\hat{Y}}\) as defined for linear regression through the origin is an unbiased estimator of \({E[Y]}\). \(\hat{Y_i}=b_1X_i\)
\(\hat{Y_i}=\frac{\sum{X_iY_i}}{\sum{X_i^2}}X_i\)
Because \(\bar X=0\), \(\hat{Y_i}=k_i\sum{X_iY_i}\)
Because \(\sum{k_iX_i}=1\), then \(\hat{Y_i}=\sum{Y_i}\)
Because \(\sum{Y_i}=\sum{E[Y_i]}\), then \(\hat{Y_i}=\sum{E[Y_i]}\) \(\hat{Y_i}=\sum{\beta_1X_i}\) So, for a single data point: \(\hat{Y_i}=b_1X_i\)

Part 7 Bonus Chapter 4, problem 4.26

Refer to the CDI data set. Consider the regression relation of number of active physicians to total population.

CDI_data <- read.csv("CDI_data.txt", header = FALSE, sep = ",")
colnames(CDI_data)<- c("NA", "Identification Number", "County", "State", "Land Area", "Total Population", "Percent of Population aged 18-34", "Percent of population 65 or older", "Number of active physicians", "Number of hospital beds", "Total serious crimes", "Percent high school graduates", "Percent bachelor's degrees", "Percent below poverty level", "Percent unemployment", "Per capita income", "Total personal income", "Geographic region")
active.docs <- CDI_data[,9]
total.pop <- CDI_data[,6]
doctors.to.pop.lm <- lm(active.docs~total.pop)
doctors.to.pop.lm

## 
## Call:
## lm(formula = active.docs ~ total.pop)
## 
## Coefficients:
## (Intercept)    total.pop  
##  -1.106e+02    2.795e-03

plot(total.pop, active.docs)
abline(doctors.to.pop.lm)

a.

Obtain Bonferroni joint confidence intervals for \({\beta_0}\) and \({\beta_1}\), using a 95 percent family confidence coefficient.

joint.fam <- 2 #number of joint betas we want.
alpha <- 0.05
joint.level <- alpha/joint.fam
bonf.int <- confint(doctors.to.pop.lm, level = 1-joint.level)
bonf.int

##                    1.25 %       98.75 %
## (Intercept) -1.887833e+02 -32.486285498
## total.pop    2.686636e-03   0.002904214

\(b_0=-110.6\) \(b_1=0.0028\) Bonferroni Confidence Intervals:
\(-188.783 \le \beta_0 \le -32.486\) \(0.00269 \le \beta_1 \le 0.00290\)

b.

An investigator has suggested that \({\beta_0}\) should be -100 and \({\beta_1}\) should be 0.0028. Do the joint confidence intervals in part (a) support this view? Discuss.

The investigator’s suggestion certainly fit with the joint confidence intervals found in part (a).

c.

It is desired to estimate the expected number of active physicians for counties with total population of \({X=500,1000,5,000}\) thousands with family confidence coefficient 0.90. Which procedure, the Working-Hotelling or the Bonferroni, is more efficient here?

Given that we are interested in estimating expected numbers of active physicians in larger populations, but the number of families of confidence intervals is unchanged, the Bonferroni Procedure would be more efficient.

d.

Obtain the family of interval estimates required in part (c), using the more efficent procedure. Interpret your confidence intervals.

#500
newdata <- data.frame(total.pop=500)
bonf.conf.int.500 <- predict(doctors.to.pop.lm, newdata, level = 1-joint.level, interval = "confidence", se.fit=TRUE)
bonf.conf.int.500

## $fit
##         fit       lwr       upr
## 1 -109.2371 -187.3558 -31.11832
## 
## $se.fit
## [1] 34.7328
## 
## $df
## [1] 438
## 
## $residual.scale
## [1] 610.0848

#1000
newdata <- data.frame(total.pop=1000)
bonf.conf.int.1000 <- predict(doctors.to.pop.lm, newdata, level = 1-joint.level, interval = "confidence", se.fit = TRUE)
bonf.conf.int.1000

## $fit
##         fit       lwr       upr
## 1 -107.8394 -185.9284 -29.75033
## 
## $se.fit
## [1] 34.71958
## 
## $df
## [1] 438
## 
## $residual.scale
## [1] 610.0848

#5000
newdata <- data.frame(total.pop=5000)
bonf.conf.int.5000 <- predict(doctors.to.pop.lm, newdata, level = 1-joint.level, interval = "confidence", se.fit = TRUE)
bonf.conf.int.5000

## $fit
##         fit       lwr       upr
## 1 -96.65765 -174.5099 -18.80543
## 
## $se.fit
## [1] 34.6143
## 
## $df
## [1] 438
## 
## $residual.scale
## [1] 610.0848

\(X_h=500\)
\(-187.3558\le Y_{h(new)}\le-31.11832\)

\(X_h=1000\)
\(-185.9284\le Y_{h(new)}\le-29.75033\)

\(X_h=5000\)
\(-174.5099\le Y_{h(new)}\le-18.80543\)

As the total population increases, the range of predicted active physicians also increases. Additionally, the range in the confidence interval decreases slightly with an increase in the total population.

STP 530 Homework #3

Peter Crank

October 26, 2017