Model Adequacy Checking

Besides previous methods R is capable of doing linear regression quite easily

#Installing Libraries if needed
#install.packages("tidyverse")
# Loading Libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

library(MASS)

For R to do the analysis first we need a dataframe

# Loading the mtcars data set
cars<-mtcars

we then use the lm command. in the form of

a<-lm(b~c,d)

This is the name that we are going to store the linear regression as.
This is the variable that we are trying to model.
This is the set of variables that we want to predict with. We can use (.) to use all from the data set
This is the name of the data set

# Creating the linear regression 
carslm<-lm(mpg~.,cars)

# Looking at the results of the linear regression
summary(carslm)

## 
## Call:
## lm(formula = mpg ~ ., data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

This output give the estimates for \(\textbf{b}\), their associated t values, the adjusted R-squared, and the F-statistic.

Model Adequacy Checking

Before using a model you must always check two aspects of its adequacy.

Verify that none of the assumptions of the least squares regression are violated. They are:

The error \(\varepsilon\) has mean zero
The error \(\varepsilon\) has constant variance
The error \(\varepsilon_i\) are uncorrelated

Examine the model to see if it is an adequate approximation to the true system.

If it is you are fine
If it is not then you may need to use transformations

Residual Analysis

To verify that none of the assumptions of the least squares regression are violated we focus on the residuals. In particular we will look at two plots

The residual versus fit plot.
The normal probability plot of the residuals.

# Testing plots
par(mfrow=c(2,2))
plot(carslm)

The values for the Residual vs Fitted should look like a random pattern. If it does not then we will need to transform are variables.
The values for the normal probability plot should be fairly close to the straight line. If not then we will need to transform our y values.
The other two graphs are dealing with scaled residuals

Scaling Residuals

In general scaled residuals should lie between \(-3<r<3\) thus if anything is outside that range we can inspect it to look for outliers.

Standardized and Studentized Residuals

Scaling residuals is preferred by some analysts, and it does help with looking for outliers. One of the typical forms is :

\(d_i=\frac{e_i}{\hat{\sigma}}\)

This method works if the standard deviations of each residual is about the same, but if they are not then it will not work too well, so instead it is suggested to use the studentized residual.

\(r_i=\frac{e_i}{\sqrt{\hat{\sigma}^2(1-h_{ii})}}\)
Where \(\textbf{H}=\textbf{X}(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\)

The 3rd and 4th graphs from the previous plot command use the studentized residuals.

# Finding H
x<-cars[,2:11]
X<-as.matrix(x)
H<-X%*%solve(t(X)%*%X)%*%t(X)

# Finding the studentized residuals
carsstudres<-studres(carslm)

R Student

We will often use \(MS_E\) for an estimate for \(\hat{\sigma}^2\), but if one residual is very high then that may skew the MS_E to being higher than it should be. One solution is to have a different estimate for \(\hat{\sigma}^2\) for each residual and then use that to standardize each one.

\(S_{(i)}^2 = \frac{(n-p)MS_E-e_i^2/(1-h_{ii})}{n-p-1}\)
\(t_i=\frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}}\)

This gives actual t-values for each residual that we can then test for being unusual. In actuallity the studres command used before uses this calculation.

Bonferroni

A quick not if you are performing simultaneous inferences, then you need to adjust your \(\alpha\) appropriately, otherwise with an \(\alpha=0.05\) 1 of every 20 things you are checking will appear significant by random chance. so instead of comparing all these t-values to \(t_{\alpha/2,n-p-1}\) you will instead use \(t_{\alpha/(2n),n-p-1}\)

Influence Diagnostics

Another item to consider is if we have any influential points in the model. These are points that could skew the model if they are slightly off from the predicted line if they weren’t in the model.

The way to check influence is to create a model without that one point and compare it to the model that has all the points in it. You must then repeat this for every point. Thankfully R can do this for use using the dffits command.

# Set  n and p
p<-dim(X)[2]+1
n<-dim(X)[1]+0

# Finding influential points
cars[abs(as.numeric(dffits(carslm)))>2*sqrt(p/(n-p)),]

##                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 230       22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4

Leverage Points

We are also able to use the \(\textbf{H}\) matrix to determine high leverage points. It does this by looking at each diagonal point and determining whether \(h_{ii}>2p/n\). If this is true then the ith point is said to have high leverage.

# find the rows of the dataset that are influential
cars[diag(H)>(2*p/n),0]

## data frame with 0 columns and 0 rows

Box-Cox Tranformation

When we need to transform it is not always easy to determine what transformation to use. Thankfully the Box-Cox method was created that can help determine which transformation is ideal for your y value. What the Box-Cox method does is to create a bunch of different values of \(\lambda\) and check the \(SS_E\) for each model using the transformation \(y^\lambda\) It then graphs this and shows the user which \(\lambda\) to choose.

Using the command below you will get a box-cox plot.
You will then look for where the center dashed line is.
Then choose the closest reasonable value to that line.

It makes more sense to use \(\lambda=.5\) than \(\lambda=.56\) since you still need to be able to explain the model

If you choose \(\lambda=0\) then your transformation is \(\ln{y}\)
Otherwise, your transformation is \(y^\lambda\)

boxcox(carslm,    # lm or aov objects or formulas
       lambda = seq(-2, 2, 1/10), # Vector of values of lambda
       plotit = TRUE,  # Create a plot or not
       eps = 1/50,     # Tolerance for lambda. Defaults to 0.02.
       xlab = expression(lambda), # X-axis title
       ylab = "log-Likelihood",   # Y-axis title
       )

# Creating the transformed linear regression
lncarslm<-lm(log(mpg)~.,cars)

# Looking at the results of the transformed linear regression
summary(lncarslm)

## 
## Call:
## lm(formula = log(mpg) ~ ., data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14569 -0.07886 -0.01752  0.06524  0.25130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.776e+00  8.492e-01   3.268  0.00367 **
## cyl          7.657e-03  4.741e-02   0.161  0.87326   
## disp         4.989e-05  8.102e-04   0.062  0.95149   
## hp          -8.964e-04  9.877e-04  -0.908  0.37439   
## drat         2.220e-02  7.420e-02   0.299  0.76772   
## wt          -1.723e-01  8.595e-02  -2.005  0.05804 . 
## qsec         3.077e-02  3.316e-02   0.928  0.36401   
## vs          -2.874e-03  9.548e-02  -0.030  0.97627   
## am           4.738e-02  9.331e-02   0.508  0.61693   
## gear         5.925e-02  6.775e-02   0.875  0.39170   
## carb        -2.012e-02  3.760e-02  -0.535  0.59826   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1202 on 21 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8369 
## F-statistic: 16.91 on 10 and 21 DF,  p-value: 6.89e-08

# Assessing the model adequacy
par(mfrow=c(2,2))
plot(lncarslm)

On Your Own

Question 1

The following data were collected on the wear of a bearing y, the oil viscosity \(x_1\), and load \(x_2\)

bearings<-read.csv("./ball bearing.csv")

a

Fit a multiple linear regression model to these data points.

bearings <- read.csv("ball bearing.csv")


# run regression
model <- lm(wear ~ viscosity + load, data = bearings)
summary(model)

## 
## Call:
## lm(formula = wear ~ viscosity + load, data = bearings)
## 
## Residuals:
##       1       2       3       4       5       6 
## -24.987  24.307  11.820 -20.460  12.830  -3.511 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 350.99427   74.75307   4.695   0.0183 *
## viscosity    -1.27199    1.16914  -1.088   0.3562  
## load         -0.15390    0.08953  -1.719   0.1841  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.5 on 3 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.7696 
## F-statistic: 9.353 on 2 and 3 DF,  p-value: 0.05138

Show in New Window

Call: lm(formula = wear ~ viscosity + load, data = bearings)

Residuals: 1 2 3 4 5 6 -24.987 24.307 11.820 -20.460 12.830 -3.511

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 350.99427 74.75307 4.695 0.0183 * viscosity -1.27199 1.16914 -1.088 0.3562
load -0.15390 0.08953 -1.719 0.1841
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 25.5 on 3 degrees of freedom Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696 F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138

b

Test for significance of regression

model <- lm(wear ~ viscosity + load, data = bearings)

#summary
summary(model)

## 
## Call:
## lm(formula = wear ~ viscosity + load, data = bearings)
## 
## Residuals:
##       1       2       3       4       5       6 
## -24.987  24.307  11.820 -20.460  12.830  -3.511 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 350.99427   74.75307   4.695   0.0183 *
## viscosity    -1.27199    1.16914  -1.088   0.3562  
## load         -0.15390    0.08953  -1.719   0.1841  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.5 on 3 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.7696 
## F-statistic: 9.353 on 2 and 3 DF,  p-value: 0.05138

#f statistic and p-value 
f_stat <- summary(model)$fstatistic[1]
df1 <- summary(model)$fstatistic[2]
df2 <- summary(model)$fstatistic[3]

p_value <- pf(f_stat, df1, df2, lower.tail = FALSE)

cat("F-statistic:", f_stat, "\n")

## F-statistic: 9.353035

cat("p-value:", p_value, "\n")

## p-value: 0.05138189

# alpha = 0.05
alpha <- 0.05

cat("fail to reject H0: the regression is not statistically significant. p value is greater than alpha.")

## fail to reject H0: the regression is not statistically significant. p value is greater than alpha.

Call: lm(formula = wear ~ viscosity + load, data = bearings)

Residuals: 1 2 3 4 5 6 -24.987 24.307 11.820 -20.460 12.830 -3.511

Residual standard error: 25.5 on 3 degrees of freedom Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696 F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138

F-statistic: 9.353035 p-value: 0.05138189 fail to reject H0: the regression is not statistically significant.

c

Compute t-statistic for each of the regression coefficients and provide an interpretation.

# fit model
model <- lm(wear ~ viscosity + load, data = bearings)

# get summary
model_summary <- summary(model)

# extract coefficients table
coef_table <- model_summary$coefficients
t_values <- coef_table[, "t value"]
t_values

## (Intercept)   viscosity        load 
##    4.695382   -1.087974   -1.719030

#full table (estimate, std error, t, p-value)
coef_table

##                Estimate  Std. Error   t value   Pr(>|t|)
## (Intercept) 350.9942706 74.75307396  4.695382 0.01826948
## viscosity    -1.2719944  1.16914009 -1.087974 0.35620034
## load         -0.1539042  0.08952967 -1.719030 0.18410102

(Intercept) viscosity load 4.695382 -1.087974 -1.719030 Estimate Std. Error t value Pr(>|t|) (Intercept) 350.9942706 74.75307396 4.695382 0.01826948 viscosity -1.2719944 1.16914009 -1.087974 0.35620034 load -0.1539042 0.08952967 -1.719030 0.18410102

d

Now include an interaction term and perform a regression model.

#model with interaction term
model_int <- lm(wear ~ viscosity * load, data = bearings)
summary(model_int)

## 
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
## 
## Residuals:
##       1       2       3       4       5       6 
## -13.025  23.105 -10.521  -7.364  14.478  -6.674 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    125.865548 197.957166   0.636    0.590
## viscosity        7.758641   7.514795   1.032    0.410
## load             0.094304   0.220657   0.427    0.711
## viscosity:load  -0.009186   0.007564  -1.214    0.349
## 
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared:  0.9205, Adjusted R-squared:  0.8011 
## F-statistic: 7.714 on 3 and 2 DF,  p-value: 0.1169

Call: lm(formula = wear ~ viscosity * load, data = bearings)

Residuals: 1 2 3 4 5 6 -13.025 23.105 -10.521 -7.364 14.478 -6.674

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548 197.957166 0.636 0.590 viscosity 7.758641 7.514795 1.032 0.410 load 0.094304 0.220657 0.427 0.711 viscosity:load -0.009186 0.007564 -1.214 0.349

Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169

e

Test for significance of regression.

#modl with interaction
model_int <- lm(wear ~ viscosity * load, data = bearings)
summary(model_int)

## 
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
## 
## Residuals:
##       1       2       3       4       5       6 
## -13.025  23.105 -10.521  -7.364  14.478  -6.674 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    125.865548 197.957166   0.636    0.590
## viscosity        7.758641   7.514795   1.032    0.410
## load             0.094304   0.220657   0.427    0.711
## viscosity:load  -0.009186   0.007564  -1.214    0.349
## 
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared:  0.9205, Adjusted R-squared:  0.8011 
## F-statistic: 7.714 on 3 and 2 DF,  p-value: 0.1169

#getting the F-statistic and p-value
f_stat <- summary(model_int)$fstatistic[1]
df1 <- summary(model_int)$fstatistic[2]
df2 <- summary(model_int)$fstatistic[3]

p_value <- pf(f_stat, df1, df2, lower.tail = FALSE)
cat("F-statistic:", f_stat, "\n")

## F-statistic: 7.714138

cat("p-value:", p_value, "\n")

## p-value: 0.116915

cat("Fail to reject H0: The regression with interaction is not statistically significant. alpha is less than p value.")

## Fail to reject H0: The regression with interaction is not statistically significant. alpha is less than p value.

Call: lm(formula = wear ~ viscosity * load, data = bearings)

Residuals: 1 2 3 4 5 6 -13.025 23.105 -10.521 -7.364 14.478 -6.674

Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169

F-statistic: 7.714138 p-value: 0.116915 Fail to reject H0: The regression with interaction is not statistically significant.

f

Compute a t-statistics for each of the regression coefficients and provide an interpretation. Specifically, does the model require an interaction term?

#interaction model
model_int <- lm(wear ~ viscosity * load, data = bearings)

#t-stats and p-values
summary(model_int)

## 
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
## 
## Residuals:
##       1       2       3       4       5       6 
## -13.025  23.105 -10.521  -7.364  14.478  -6.674 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    125.865548 197.957166   0.636    0.590
## viscosity        7.758641   7.514795   1.032    0.410
## load             0.094304   0.220657   0.427    0.711
## viscosity:load  -0.009186   0.007564  -1.214    0.349
## 
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared:  0.9205, Adjusted R-squared:  0.8011 
## F-statistic: 7.714 on 3 and 2 DF,  p-value: 0.1169

#coefficient table
coef_table <- summary(model_int)$coefficients

# view t-statistics
t_values <- coef_table[, "t value"]


t_values

##    (Intercept)      viscosity           load viscosity:load 
##      0.6358221      1.0324488      0.4273783     -1.2144702

coef_table

##                     Estimate   Std. Error    t value  Pr(>|t|)
## (Intercept)    125.865548338 1.979572e+02  0.6358221 0.5899432
## viscosity        7.758641128 7.514795e+00  1.0324488 0.4103613
## load             0.094303967 2.206569e-01  0.4273783 0.7107188
## viscosity:load  -0.009185794 7.563622e-03 -1.2144702 0.3485016

Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169

(Intercept) viscosity load viscosity:load 0.6358221 1.0324488 0.4273783 -1.2144702 Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548338 1.979572e+02 0.6358221 0.5899432 viscosity 7.758641128 7.514795e+00 1.0324488 0.4103613 load 0.094303967 2.206569e-01 0.4273783 0.7107188 viscosity:load -0.009185794 7.563622e-03 -1.2144702 0.3485016

g

Use the partial F-test procedure to determine whether the model requires an interaction term. Is this procedure equivalent to the t-test computed in part f? both say the interaction term is not significant, so the model does not need an interaction term.

# reduced model (no interaction)
model_red <- lm(wear ~ viscosity + load, data = bearings)

# full model (with interaction)
model_full <- lm(wear ~ viscosity * load, data = bearings)

# partial F-test
anova(model_red, model_full)

## Analysis of Variance Table
## 
## Model 1: wear ~ viscosity + load
## Model 2: wear ~ viscosity * load
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1      3 1950.4                           
## 2      2 1122.6  1    827.86 1.4749 0.3485

Model 1: wear ~ viscosity + load Model 2: wear ~ viscosity * load Res.Df RSS Df Sum of Sq F Pr(>F) 1 3 1950.4
2 2 1122.6 1 827.86 1.4749 0.3485

h

Compute the estimate for \(y\) for the point \(x_1=25\) and \(x_2=1000\) using both models The predicted wear at𝑥1=25 and x2=1000 using the model without interaction is 165.29, while the prediction using the model with interaction is 184.49.

#newdata point
new_point <- data.frame(viscosity = 25, load = 1000)

#model without interaction
model_red <- lm(wear ~ viscosity + load, data = bearings)

#model with interaction
model_full <- lm(wear ~ viscosity * load, data = bearings)

#predictions
predict(model_red, new_point)

##        1 
## 165.2902

predict(model_full, new_point)

##        1 
## 184.4907

i

Find a 95% confidence interval on the mean response at this point for both models. Which interval is narrower? Does this provide any insight about which model is preferable? The 95% confidence interval for the model without interaction is narrower than the interval for the model with interaction, so it shows that the simpler model provides a more precise estimate of the mean response.

#new data point
new_point <- data.frame(viscosity = 25, load = 1000)

#models
model_red <- lm(wear ~ viscosity + load, data = bearings)
model_full <- lm(wear ~ viscosity * load, data = bearings)

# 95% CI for mean response
ci_red <- predict(model_red, new_point, interval = "confidence", level = 0.95)
ci_full <- predict(model_full, new_point, interval = "confidence", level = 0.95)

# print results
ci_red

##        fit      lwr      upr
## 1 165.2902 128.2709 202.3094

ci_full

##        fit      lwr      upr
## 1 184.4907 102.0899 266.8914

Question 2

An engineer at a semiconductor company wants to model the relationship between the device gain or hFE(\(y\)) and three parameters: emitter RS(\(x_1\)), base-RS(\(x_2\)), and emitter-to-base-RS(\(x_3\)). The data are shown below.

hfe<-read.csv("./hFE.csv")

a

Fit a multiple linear regression model to the data.

# load data
hfe <- read.csv("hFE.csv")

# clean column names (safer)
colnames(hfe) <- c("emitter", "base", "eb", "hfe")

# fit multiple linear regression model
model_hfe <- lm(hfe ~ emitter + base + eb, data = hfe)

# view results
summary(model_hfe)

## 
## Call:
## lm(formula = hfe ~ emitter + base + eb, data = hfe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9867 -2.3942  0.2606  1.8955  7.8847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.7960    51.7695  -0.402    0.694    
## emitter      -3.2664     3.6021  -0.907    0.379    
## base          0.2389     0.2761   0.865    0.400    
## eb           20.3707     1.2631  16.128 6.95e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.158 on 15 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9898 
## F-statistic:   586 on 3 and 15 DF,  p-value: 9.219e-16

The Ftest gives a p-value that is much less than 0.05. Therefore, we reject the null hypothesis and conclude that the regression is statistically significant. At least one predictor is significantly related to hFE. ##### b Predict hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\)

# new data point
new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)

# prediction
predict(model_hfe, new_point)

##       1 
## 86.2593

c

Test for significance of regression using the analysis of variance with \(\alpha=0.05\). What conclusions can you draw?

#extract F-test from summary
summary(model_hfe)

## 
## Call:
## lm(formula = hfe ~ emitter + base + eb, data = hfe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9867 -2.3942  0.2606  1.8955  7.8847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.7960    51.7695  -0.402    0.694    
## emitter      -3.2664     3.6021  -0.907    0.379    
## base          0.2389     0.2761   0.865    0.400    
## eb           20.3707     1.2631  16.128 6.95e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.158 on 15 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9898 
## F-statistic:   586 on 3 and 15 DF,  p-value: 9.219e-16

d

Estimate \(\sigma^2\) for the model you have fit to the data.

# estimate of sigma^2
sigma2_hat <- summary(model_hfe)$sigma^2
sigma2_hat

## [1] 17.29024

e

Find the standard errors of the regression coefficients.

#coefficient table
coef_table <- summary(model_hfe)$coefficients
coef_table

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -20.7959544 51.7694814 -0.4017030 6.935678e-01
## emitter      -3.2664295  3.6021456 -0.9068011 3.788490e-01
## base          0.2389327  0.2761206  0.8653199 4.004914e-01
## eb           20.3706583  1.2630959 16.1275624 6.948639e-11

Estimate Std. Error t value Pr(>|t|) (Intercept) -20.7959544 51.7694814 -0.4017030 6.935678e-01 emitter -3.2664295 3.6021456 -0.9068011 3.788490e-01 base 0.2389327 0.2761206 0.8653199 4.004914e-01 eb 20.3706583 1.2630959 16.1275624 6.948639e-11

f

Calculate the t-statistic for each regression coefficient. Using \(\alpha=0.05\), what conclusions can you draw? The t-statistics indicate that only the coefficient for emitter-to-base RS is statistically significant (p < 0.05). The coefficients for emitter RS and base RS are not statistically significant, so hFE is primarily influenced by emitter-to-base RS.

g

Find 99% confidence intervals on the regression coefficients.

# 99% confidence intervals
confint(model_hfe, level = 0.99)

##                    0.5 %     99.5 %
## (Intercept) -173.3457523 131.753843
## emitter      -13.8809182   7.348059
## base          -0.5747156   1.052581
## eb            16.6486773  24.092639

0.5 % 99.5 % (Intercept) -173.3457523 131.753843 emitter -13.8809182 7.348059 base -0.5747156 1.052581 eb 16.6486773 24.092639

h

Find a 99% prediction interval on hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\).

new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)
# 99% prediction interval
predict(model_hfe, new_point, interval = "prediction", level = 0.99)

##       fit      lwr    upr
## 1 86.2593 71.04857 101.47

fit lwr upr 1 86.2593 71.04857 101.47

i

Find a 99% confidence interval on mean hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\)

#99% confidence interval for mean response
new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)
predict(model_hfe, new_point, interval = "confidence", level = 0.99)

##       fit      lwr      upr
## 1 86.2593 77.24635 95.27225

fit lwr upr 1 86.2593 77.24635 95.27225

Question 3

Use the hFE data to complete the following:

hfe<-read.csv("./hFE data.csv")

Plot the residuals from this model versus \(\hat{y}\). Comment on the information in this plot.

# fit model
colnames(hfe) <- c("emitter", "base", "eb", "hfe")
model_hfe <- lm(hfe ~ emitter + base + eb, data = hfe)

#residuals vs fitted plot
plot(fitted(model_hfe), resid(model_hfe),
     xlab = "Fitted Values (ŷ)",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

The residuals are scattered around zero without a strong systematic pattern, which suggests that the linear model is generally appropriate

What is the value of \(R^2\) for this model?

summary(model_hfe)$r.squared

## [1] 0.9915393

The coefficient of determination is R2=0.9915, the model explains 99.15% of the variation in hFE.

Refit the model using \(\ln{hFE}\) as the response variable.

# refit model using log(hFE)
model_log <- lm(log(hfe) ~ emitter + base + eb, data = hfe)

# view results
summary(model_log)

## 
## Call:
## lm(formula = log(hfe) ~ emitter + base + eb, data = hfe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17084 -0.01303  0.01756  0.05101  0.09894 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.670571   1.030468   5.503 6.07e-05 ***
## emitter     -0.102152   0.071700  -1.425    0.175    
## base        -0.002780   0.005496  -0.506    0.620    
## eb           0.183030   0.025142   7.280 2.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08277 on 15 degrees of freedom
## Multiple R-squared:  0.963,  Adjusted R-squared:  0.9556 
## F-statistic:   130 on 3 and 15 DF,  p-value: 5.852e-11

Call: lm(formula = log(hfe) ~ emitter + base + eb, data = hfe)

Residuals: Min 1Q Median 3Q Max -0.17084 -0.01303 0.01756 0.05101 0.09894

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.670571 1.030468 5.503 6.07e-05 emitter -0.102152 0.071700 -1.425 0.175
base -0.002780 0.005496 -0.506 0.620
eb 0.183030 0.025142 7.280 2.70e-06 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08277 on 15 degrees of freedom Multiple R-squared: 0.963, Adjusted R-squared: 0.9556 F-statistic: 130 on 3 and 15 DF, p-value: 5.852e-11

Plot the residuals versus predicted \(\ln{hFE}\) for the model in part d above. Does this give us any information about which of the models is preferable?

# residuals vs fitted for log model
plot(fitted(model_log), resid(model_log),
     xlab = "Fitted Values (ln(hFE))",
     ylab = "Residuals",
     main = "Residuals vs Fitted (Log Model)")
abline(h = 0, col = "red")

The log transformation improved the model assumptions, especially the constant variance assumption.

Plot the residuals from the model in part e, versus the regressor \(x_3\). Comment on this plot.

# residuals vs x3 (eb) for log model
plot(hfe$eb, resid(model_log),
     xlab = "Emitter-to-Base RS (x3)",
     ylab = "Residuals",
     main = "Residuals vs x3 (Log Model)")
abline(h = 0, col = "red")

The residuals appear randomly scattered around zero with no clear pattern when plotted against x3.

Refit the model to \(\ln{hFE}\) using \(x_1\), \(x_2\), and \(1/x_3\) as the regressors. Comment on the effect of this change to the model.

# create transformed variable
hfe$inv_eb <- 1 / hfe$eb

# fit new model
model_inv <- lm(log(hfe) ~ emitter + base + inv_eb, data = hfe)

# view results
summary(model_inv)

## 
## Call:
## lm(formula = log(hfe) ~ emitter + base + inv_eb, data = hfe)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.077879 -0.033676  0.005157  0.028274  0.119432 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.016600   0.707536   7.090 3.68e-06 ***
## emitter     -0.121638   0.044770  -2.717  0.01591 *  
## base         0.010536   0.002816   3.741  0.00197 ** 
## inv_eb      -5.179971   0.449659 -11.520 7.54e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05616 on 15 degrees of freedom
## Multiple R-squared:  0.983,  Adjusted R-squared:  0.9795 
## F-statistic: 288.3 on 3 and 15 DF,  p-value: 1.756e-13

# optional: residual plots
par(mfrow=c(2,2))
plot(model_inv)

The transformation improves the model. All the regression coefficients are now statistically significant, and the residual plots show there could be no clear pattern, indicating better adherence to model assumptions. This suggests that the transformed model better captures the relationship between hFE and the predictors.

Question 4

In an effort to develop a preliminary manpower equation for estimation of man-hours per month expended in surgical services at Naval hospitals, the U.S. Navy collected data on y (man-hours per month) and x (surgical cases) from 15 hospitals. The data is in the navy.csv file. Please do the following:

navy<-read.csv("./Navy.csv")

Fit the model \(y=\beta_0+\beta_1x+\varepsilon\)

navy <- read.csv("Navy.csv")
colnames(navy) <- c("man_hours", "cases")

# fit linear regression
model_navy <- lm(man_hours ~ cases, data = navy)
summary(model_navy)

## 
## Call:
## lm(formula = man_hours ~ cases, data = navy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1697.1 -1003.8  -561.4   513.1  2652.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 937.6068   631.5202   1.485    0.161    
## cases         6.5125     0.5765  11.297  4.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1428 on 13 degrees of freedom
## Multiple R-squared:  0.9076, Adjusted R-squared:  0.9004 
## F-statistic: 127.6 on 1 and 13 DF,  p-value: 4.295e-08

Call: lm(formula = man_hours ~ cases, data = navy)

Residuals: Min 1Q Median 3Q Max -1697.1 -1003.8 -561.4 513.1 2652.9

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 937.6068 631.5202 1.485 0.161
cases 6.5125 0.5765 11.297 4.3e-08 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1428 on 13 degrees of freedom Multiple R-squared: 0.9076, Adjusted R-squared: 0.9004 F-statistic: 127.6 on 1 and 13 DF, p-value: 4.295e-08

Fit the model \(\ln{y}=\beta_0+\beta_1(1/x)+\varepsilon\)

#transformed variable
navy$inv_cases <- 1 / navy$cases

#model
model_log_inv <- lm(log(man_hours) ~ inv_cases, data = navy)
summary(model_log_inv)

## 
## Call:
## lm(formula = log(man_hours) ~ inv_cases, data = navy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30376 -0.12213  0.06109  0.10161  0.19086 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.66534    0.07077  136.57  < 2e-16 ***
## inv_cases   -590.94202   30.03516  -19.68 4.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1589 on 13 degrees of freedom
## Multiple R-squared:  0.9675, Adjusted R-squared:  0.965 
## F-statistic: 387.1 on 1 and 13 DF,  p-value: 4.67e-11

Call: lm(formula = log(man_hours) ~ inv_cases, data = navy)

Residuals: Min 1Q Median 3Q Max -0.30376 -0.12213 0.06109 0.10161 0.19086

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.66534 0.07077 136.57 < 2e-16 inv_cases -590.94202 30.03516 -19.68 4.67e-11 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1589 on 13 degrees of freedom Multiple R-squared: 0.9675, Adjusted R-squared: 0.965 F-statistic: 387.1 on 1 and 13 DF, p-value: 4.67e-11

Fit the model \(1/y=\beta_0+\beta_1(1/x)+\varepsilon\)

#transformed variables
navy$inv_cases <- 1 / navy$cases
navy$inv_y <- 1 / navy$man_hours

#model
model_inv <- lm(inv_y ~ inv_cases, data = navy)
summary(model_inv)

## 
## Call:
## lm(formula = inv_y ~ inv_cases, data = navy)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.427e-05 -2.721e-05 -2.039e-05  4.375e-05  8.143e-05 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.510e-05  2.029e-05  -2.716   0.0177 *  
## inv_cases    1.743e-01  8.611e-03  20.246 3.26e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.556e-05 on 13 degrees of freedom
## Multiple R-squared:  0.9693, Adjusted R-squared:  0.9669 
## F-statistic: 409.9 on 1 and 13 DF,  p-value: 3.256e-11

Call: lm(formula = inv_y ~ inv_cases, data = navy)

Residuals: Min 1Q Median 3Q Max -7.427e-05 -2.721e-05 -2.039e-05 4.375e-05 8.143e-05

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.510e-05 2.029e-05 -2.716 0.0177 *
inv_cases 1.743e-01 8.611e-03 20.246 3.26e-11 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.556e-05 on 13 degrees of freedom Multiple R-squared: 0.9693, Adjusted R-squared: 0.9669 F-statistic: 409.9 on 1 and 13 DF, p-value: 3.256e-11

Fit the model \(y=\beta_0+\beta_1x+\beta_2x^2+\varepsilon\)

# fit quadratic model
model_quad <- lm(man_hours ~ cases + I(cases^2), data = navy)
summary(model_quad)

## 
## Call:
## lm(formula = man_hours ~ cases + I(cases^2), data = navy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -842.1 -550.3   98.9  392.7  949.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.294e+03  5.242e+02  -4.375 0.000903 ***
## cases        1.511e+01  1.205e+00  12.545 2.94e-08 ***
## I(cases^2)  -3.682e-03  5.038e-04  -7.308 9.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 636.6 on 12 degrees of freedom
## Multiple R-squared:  0.983,  Adjusted R-squared:  0.9802 
## F-statistic: 347.7 on 2 and 12 DF,  p-value: 2.382e-11

Call: lm(formula = man_hours ~ cases + I(cases^2), data = navy)

Residuals: Min 1Q Median 3Q Max -842.1 -550.3 98.9 392.7 949.0

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.294e+03 5.242e+02 -4.375 0.000903 cases 1.511e+01 1.205e+00 12.545 2.94e-08 I(cases^2) -3.682e-03 5.038e-04 -7.308 9.38e-06 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 636.6 on 12 degrees of freedom Multiple R-squared: 0.983, Adjusted R-squared: 0.9802 F-statistic: 347.7 on 2 and 12 DF, p-value: 2.382e-11

Comment o the adequacy of each model The linear model (a) provides a reasonable fit but does not capture the curvature in the data. Models (b) and (c) improve the fit through transformations, indicating a nonlinear relationship. Over all though the quadratic model (d) provides the best fit, with the highest R^2lowest residual error, and all significant predictors. So in conclusion the quadratic model is the most adequate for describing the relationship between surgical cases and man-hours.

Question 5

Consider the following data set in which nitrogen dioxide concentrations in parts per million are collected for 26 days in September 1984 at a monitoring location in San Francisco Bay area. Please do the following.

nitrogen<-read.csv("./nitrogen.csv")

Fit a standard multiple regression model to these data and comment on the model adequacy.

nitrogen <- read.csv("nitrogen.csv")
colnames(nitrogen) <- c("wind", "temp", "insolation", "no")

#multiple regression
model_nitrogen <- lm(no ~ wind + temp + insolation, data = nitrogen)

# summary
summary(model_nitrogen)

## 
## Call:
## lm(formula = no ~ wind + temp + insolation, data = nitrogen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3052 -1.1710 -0.4990  0.9823  3.4033 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.784916   7.979647   0.474   0.6402  
## wind        -0.527410   0.224904  -2.345   0.0289 *
## temp         0.124991   0.075173   1.663   0.1112  
## insolation  -0.005259   0.006637  -0.792   0.4370  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.844 on 21 degrees of freedom
## Multiple R-squared:  0.6533, Adjusted R-squared:  0.6037 
## F-statistic: 13.19 on 3 and 21 DF,  p-value: 4.628e-05

# residual diagnostics
par(mfrow=c(2,2))
plot(model_nitrogen)

Call: lm(formula = no ~ wind + temp + insolation, data = nitrogen)

Residuals: Min 1Q Median 3Q Max -2.3052 -1.1710 -0.4990 0.9823 3.4033

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.784916 7.979647 0.474 0.6402
wind -0.527410 0.224904 -2.345 0.0289 * temp 0.124991 0.075173 1.663 0.1112
insolation -0.005259 0.006637 -0.792 0.4370
— Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.844 on 21 degrees of freedom Multiple R-squared: 0.6533, Adjusted R-squared: 0.6037 F-statistic: 13.19 on 3 and 21 DF, p-value: 4.628e-05

Consider a transformation of the response. Use the Box-Cox procedure to determine an appropriate transformation.

library(MASS)

# BoxCox plot
boxcox(model_nitrogen,
       lambda = seq(-2, 2, 0.1),
       main = "BoxCox Transformation")

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'main' will be disregarded

Does a model with transformed response seem more reasonable that the model in part a? Explain why or why not. Yes, a model with a transformed response appears more reasonable. The BoxCox plot indicates that λ is close to 0, soa log transformation of the response variable. This transformation can improve the model by stabilizing variance and making the relationship between predictors and response more linear.

Model Adequacy Checking

Sean McMannamy

2022-10-27

Model Adequacy Checking

Residual Analysis

Scaling Residuals

Standardized and Studentized Residuals

R Student

Bonferroni

Influence Diagnostics

Leverage Points

Box-Cox Tranformation

On Your Own

Question 1

a

b

c

d

e

f

g

h

i

Question 2

a

c

d

e

f

g

h

i

Question 3

Question 4

Question 5