Besides previous methods R is capable of doing linear regression quite easily
#Installing Libraries if needed
#install.packages("tidyverse")
# Loading Libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
library(MASS)
For R to do the analysis first we need a dataframe
# Loading the mtcars data set
cars<-mtcars
we then use the lm command. in the form of
# Creating the linear regression
carslm<-lm(mpg~.,cars)
# Looking at the results of the linear regression
summary(carslm)
##
## Call:
## lm(formula = mpg ~ ., data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
This output give the estimates for \(\textbf{b}\), their associated t values, the adjusted R-squared, and the F-statistic.
Before using a model you must always check two aspects of its adequacy.
To verify that none of the assumptions of the least squares regression are violated we focus on the residuals. In particular we will look at two plots
# Testing plots
par(mfrow=c(2,2))
plot(carslm)
In general scaled residuals should lie between \(-3<r<3\) thus if anything is outside that range we can inspect it to look for outliers.
Scaling residuals is preferred by some analysts, and it does help with looking for outliers. One of the typical forms is :
This method works if the standard deviations of each residual is about the same, but if they are not then it will not work too well, so instead it is suggested to use the studentized residual.
The 3rd and 4th graphs from the previous plot command use the studentized residuals.
# Finding H
x<-cars[,2:11]
X<-as.matrix(x)
H<-X%*%solve(t(X)%*%X)%*%t(X)
# Finding the studentized residuals
carsstudres<-studres(carslm)
We will often use \(MS_E\) for an estimate for \(\hat{\sigma}^2\), but if one residual is very high then that may skew the MS_E to being higher than it should be. One solution is to have a different estimate for \(\hat{\sigma}^2\) for each residual and then use that to standardize each one.
This gives actual t-values for each residual that we can then test for being unusual. In actuallity the studres command used before uses this calculation.
A quick not if you are performing simultaneous inferences, then you need to adjust your \(\alpha\) appropriately, otherwise with an \(\alpha=0.05\) 1 of every 20 things you are checking will appear significant by random chance. so instead of comparing all these t-values to \(t_{\alpha/2,n-p-1}\) you will instead use \(t_{\alpha/(2n),n-p-1}\)
Another item to consider is if we have any influential points in the model. These are points that could skew the model if they are slightly off from the predicted line if they weren’t in the model.
The way to check influence is to create a model without that one point and compare it to the model that has all the points in it. You must then repeat this for every point. Thankfully R can do this for use using the dffits command.
# Set n and p
p<-dim(X)[2]+1
n<-dim(X)[1]+0
# Finding influential points
cars[abs(as.numeric(dffits(carslm)))>2*sqrt(p/(n-p)),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
We are also able to use the \(\textbf{H}\) matrix to determine high leverage points. It does this by looking at each diagonal point and determining whether \(h_{ii}>2p/n\). If this is true then the ith point is said to have high leverage.
# find the rows of the dataset that are influential
cars[diag(H)>(2*p/n),0]
## data frame with 0 columns and 0 rows
When we need to transform it is not always easy to determine what transformation to use. Thankfully the Box-Cox method was created that can help determine which transformation is ideal for your y value. What the Box-Cox method does is to create a bunch of different values of \(\lambda\) and check the \(SS_E\) for each model using the transformation \(y^\lambda\) It then graphs this and shows the user which \(\lambda\) to choose.
boxcox(carslm, # lm or aov objects or formulas
lambda = seq(-2, 2, 1/10), # Vector of values of lambda
plotit = TRUE, # Create a plot or not
eps = 1/50, # Tolerance for lambda. Defaults to 0.02.
xlab = expression(lambda), # X-axis title
ylab = "log-Likelihood", # Y-axis title
)
# Creating the transformed linear regression
lncarslm<-lm(log(mpg)~.,cars)
# Looking at the results of the transformed linear regression
summary(lncarslm)
##
## Call:
## lm(formula = log(mpg) ~ ., data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14569 -0.07886 -0.01752 0.06524 0.25130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.776e+00 8.492e-01 3.268 0.00367 **
## cyl 7.657e-03 4.741e-02 0.161 0.87326
## disp 4.989e-05 8.102e-04 0.062 0.95149
## hp -8.964e-04 9.877e-04 -0.908 0.37439
## drat 2.220e-02 7.420e-02 0.299 0.76772
## wt -1.723e-01 8.595e-02 -2.005 0.05804 .
## qsec 3.077e-02 3.316e-02 0.928 0.36401
## vs -2.874e-03 9.548e-02 -0.030 0.97627
## am 4.738e-02 9.331e-02 0.508 0.61693
## gear 5.925e-02 6.775e-02 0.875 0.39170
## carb -2.012e-02 3.760e-02 -0.535 0.59826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1202 on 21 degrees of freedom
## Multiple R-squared: 0.8895, Adjusted R-squared: 0.8369
## F-statistic: 16.91 on 10 and 21 DF, p-value: 6.89e-08
# Assessing the model adequacy
par(mfrow=c(2,2))
plot(lncarslm)
The following data were collected on the wear of a bearing y, the oil viscosity \(x_1\), and load \(x_2\)
bearings<-read.csv("./ball bearing.csv")
Fit a multiple linear regression model to these data points.
bearings <- read.csv("ball bearing.csv")
# run regression
model <- lm(wear ~ viscosity + load, data = bearings)
summary(model)
##
## Call:
## lm(formula = wear ~ viscosity + load, data = bearings)
##
## Residuals:
## 1 2 3 4 5 6
## -24.987 24.307 11.820 -20.460 12.830 -3.511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350.99427 74.75307 4.695 0.0183 *
## viscosity -1.27199 1.16914 -1.088 0.3562
## load -0.15390 0.08953 -1.719 0.1841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.5 on 3 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696
## F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138
Show in New Window
Call: lm(formula = wear ~ viscosity + load, data = bearings)
Residuals: 1 2 3 4 5 6 -24.987 24.307 11.820 -20.460 12.830 -3.511
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 350.99427 74.75307 4.695 0.0183 * viscosity -1.27199 1.16914
-1.088 0.3562
load -0.15390 0.08953 -1.719 0.1841
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.5 on 3 degrees of freedom Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696 F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138
Test for significance of regression
model <- lm(wear ~ viscosity + load, data = bearings)
#summary
summary(model)
##
## Call:
## lm(formula = wear ~ viscosity + load, data = bearings)
##
## Residuals:
## 1 2 3 4 5 6
## -24.987 24.307 11.820 -20.460 12.830 -3.511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350.99427 74.75307 4.695 0.0183 *
## viscosity -1.27199 1.16914 -1.088 0.3562
## load -0.15390 0.08953 -1.719 0.1841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.5 on 3 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696
## F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138
#f statistic and p-value
f_stat <- summary(model)$fstatistic[1]
df1 <- summary(model)$fstatistic[2]
df2 <- summary(model)$fstatistic[3]
p_value <- pf(f_stat, df1, df2, lower.tail = FALSE)
cat("F-statistic:", f_stat, "\n")
## F-statistic: 9.353035
cat("p-value:", p_value, "\n")
## p-value: 0.05138189
# alpha = 0.05
alpha <- 0.05
cat("fail to reject H0: the regression is not statistically significant. p value is greater than alpha.")
## fail to reject H0: the regression is not statistically significant. p value is greater than alpha.
Call: lm(formula = wear ~ viscosity + load, data = bearings)
Residuals: 1 2 3 4 5 6 -24.987 24.307 11.820 -20.460 12.830 -3.511
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 350.99427 74.75307 4.695 0.0183 * viscosity -1.27199 1.16914
-1.088 0.3562
load -0.15390 0.08953 -1.719 0.1841
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.5 on 3 degrees of freedom Multiple R-squared: 0.8618, Adjusted R-squared: 0.7696 F-statistic: 9.353 on 2 and 3 DF, p-value: 0.05138
F-statistic: 9.353035 p-value: 0.05138189 fail to reject H0: the regression is not statistically significant.
Compute t-statistic for each of the regression coefficients and provide an interpretation.
# fit model
model <- lm(wear ~ viscosity + load, data = bearings)
# get summary
model_summary <- summary(model)
# extract coefficients table
coef_table <- model_summary$coefficients
t_values <- coef_table[, "t value"]
t_values
## (Intercept) viscosity load
## 4.695382 -1.087974 -1.719030
#full table (estimate, std error, t, p-value)
coef_table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350.9942706 74.75307396 4.695382 0.01826948
## viscosity -1.2719944 1.16914009 -1.087974 0.35620034
## load -0.1539042 0.08952967 -1.719030 0.18410102
(Intercept) viscosity load 4.695382 -1.087974 -1.719030 Estimate Std. Error t value Pr(>|t|) (Intercept) 350.9942706 74.75307396 4.695382 0.01826948 viscosity -1.2719944 1.16914009 -1.087974 0.35620034 load -0.1539042 0.08952967 -1.719030 0.18410102
Now include an interaction term and perform a regression model.
#model with interaction term
model_int <- lm(wear ~ viscosity * load, data = bearings)
summary(model_int)
##
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
##
## Residuals:
## 1 2 3 4 5 6
## -13.025 23.105 -10.521 -7.364 14.478 -6.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.865548 197.957166 0.636 0.590
## viscosity 7.758641 7.514795 1.032 0.410
## load 0.094304 0.220657 0.427 0.711
## viscosity:load -0.009186 0.007564 -1.214 0.349
##
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011
## F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
Call: lm(formula = wear ~ viscosity * load, data = bearings)
Residuals: 1 2 3 4 5 6 -13.025 23.105 -10.521 -7.364 14.478 -6.674
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548 197.957166 0.636 0.590 viscosity 7.758641 7.514795 1.032 0.410 load 0.094304 0.220657 0.427 0.711 viscosity:load -0.009186 0.007564 -1.214 0.349
Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
Test for significance of regression.
#modl with interaction
model_int <- lm(wear ~ viscosity * load, data = bearings)
summary(model_int)
##
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
##
## Residuals:
## 1 2 3 4 5 6
## -13.025 23.105 -10.521 -7.364 14.478 -6.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.865548 197.957166 0.636 0.590
## viscosity 7.758641 7.514795 1.032 0.410
## load 0.094304 0.220657 0.427 0.711
## viscosity:load -0.009186 0.007564 -1.214 0.349
##
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011
## F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
#getting the F-statistic and p-value
f_stat <- summary(model_int)$fstatistic[1]
df1 <- summary(model_int)$fstatistic[2]
df2 <- summary(model_int)$fstatistic[3]
p_value <- pf(f_stat, df1, df2, lower.tail = FALSE)
cat("F-statistic:", f_stat, "\n")
## F-statistic: 7.714138
cat("p-value:", p_value, "\n")
## p-value: 0.116915
cat("Fail to reject H0: The regression with interaction is not statistically significant. alpha is less than p value.")
## Fail to reject H0: The regression with interaction is not statistically significant. alpha is less than p value.
Call: lm(formula = wear ~ viscosity * load, data = bearings)
Residuals: 1 2 3 4 5 6 -13.025 23.105 -10.521 -7.364 14.478 -6.674
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548 197.957166 0.636 0.590 viscosity 7.758641 7.514795 1.032 0.410 load 0.094304 0.220657 0.427 0.711 viscosity:load -0.009186 0.007564 -1.214 0.349
Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
F-statistic: 7.714138 p-value: 0.116915 Fail to reject H0: The regression with interaction is not statistically significant.
Compute a t-statistics for each of the regression coefficients and provide an interpretation. Specifically, does the model require an interaction term?
#interaction model
model_int <- lm(wear ~ viscosity * load, data = bearings)
#t-stats and p-values
summary(model_int)
##
## Call:
## lm(formula = wear ~ viscosity * load, data = bearings)
##
## Residuals:
## 1 2 3 4 5 6
## -13.025 23.105 -10.521 -7.364 14.478 -6.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.865548 197.957166 0.636 0.590
## viscosity 7.758641 7.514795 1.032 0.410
## load 0.094304 0.220657 0.427 0.711
## viscosity:load -0.009186 0.007564 -1.214 0.349
##
## Residual standard error: 23.69 on 2 degrees of freedom
## Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011
## F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
#coefficient table
coef_table <- summary(model_int)$coefficients
# view t-statistics
t_values <- coef_table[, "t value"]
t_values
## (Intercept) viscosity load viscosity:load
## 0.6358221 1.0324488 0.4273783 -1.2144702
coef_table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.865548338 1.979572e+02 0.6358221 0.5899432
## viscosity 7.758641128 7.514795e+00 1.0324488 0.4103613
## load 0.094303967 2.206569e-01 0.4273783 0.7107188
## viscosity:load -0.009185794 7.563622e-03 -1.2144702 0.3485016
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548 197.957166 0.636 0.590 viscosity 7.758641 7.514795 1.032 0.410 load 0.094304 0.220657 0.427 0.711 viscosity:load -0.009186 0.007564 -1.214 0.349
Residual standard error: 23.69 on 2 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.8011 F-statistic: 7.714 on 3 and 2 DF, p-value: 0.1169
(Intercept) viscosity load viscosity:load 0.6358221 1.0324488 0.4273783 -1.2144702 Estimate Std. Error t value Pr(>|t|) (Intercept) 125.865548338 1.979572e+02 0.6358221 0.5899432 viscosity 7.758641128 7.514795e+00 1.0324488 0.4103613 load 0.094303967 2.206569e-01 0.4273783 0.7107188 viscosity:load -0.009185794 7.563622e-03 -1.2144702 0.3485016
Use the partial F-test procedure to determine whether the model requires an interaction term. Is this procedure equivalent to the t-test computed in part f? both say the interaction term is not significant, so the model does not need an interaction term.
# reduced model (no interaction)
model_red <- lm(wear ~ viscosity + load, data = bearings)
# full model (with interaction)
model_full <- lm(wear ~ viscosity * load, data = bearings)
# partial F-test
anova(model_red, model_full)
## Analysis of Variance Table
##
## Model 1: wear ~ viscosity + load
## Model 2: wear ~ viscosity * load
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3 1950.4
## 2 2 1122.6 1 827.86 1.4749 0.3485
Model 1: wear ~ viscosity + load Model 2: wear ~ viscosity * load
Res.Df RSS Df Sum of Sq F Pr(>F) 1 3 1950.4
2 2 1122.6 1 827.86 1.4749 0.3485
Compute the estimate for \(y\) for the point \(x_1=25\) and \(x_2=1000\) using both models The predicted wear at𝑥1=25 and x2=1000 using the model without interaction is 165.29, while the prediction using the model with interaction is 184.49.
#newdata point
new_point <- data.frame(viscosity = 25, load = 1000)
#model without interaction
model_red <- lm(wear ~ viscosity + load, data = bearings)
#model with interaction
model_full <- lm(wear ~ viscosity * load, data = bearings)
#predictions
predict(model_red, new_point)
## 1
## 165.2902
predict(model_full, new_point)
## 1
## 184.4907
Find a 95% confidence interval on the mean response at this point for both models. Which interval is narrower? Does this provide any insight about which model is preferable? The 95% confidence interval for the model without interaction is narrower than the interval for the model with interaction, so it shows that the simpler model provides a more precise estimate of the mean response.
#new data point
new_point <- data.frame(viscosity = 25, load = 1000)
#models
model_red <- lm(wear ~ viscosity + load, data = bearings)
model_full <- lm(wear ~ viscosity * load, data = bearings)
# 95% CI for mean response
ci_red <- predict(model_red, new_point, interval = "confidence", level = 0.95)
ci_full <- predict(model_full, new_point, interval = "confidence", level = 0.95)
# print results
ci_red
## fit lwr upr
## 1 165.2902 128.2709 202.3094
ci_full
## fit lwr upr
## 1 184.4907 102.0899 266.8914
An engineer at a semiconductor company wants to model the relationship between the device gain or hFE(\(y\)) and three parameters: emitter RS(\(x_1\)), base-RS(\(x_2\)), and emitter-to-base-RS(\(x_3\)). The data are shown below.
hfe<-read.csv("./hFE.csv")
Fit a multiple linear regression model to the data.
# load data
hfe <- read.csv("hFE.csv")
# clean column names (safer)
colnames(hfe) <- c("emitter", "base", "eb", "hfe")
# fit multiple linear regression model
model_hfe <- lm(hfe ~ emitter + base + eb, data = hfe)
# view results
summary(model_hfe)
##
## Call:
## lm(formula = hfe ~ emitter + base + eb, data = hfe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9867 -2.3942 0.2606 1.8955 7.8847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.7960 51.7695 -0.402 0.694
## emitter -3.2664 3.6021 -0.907 0.379
## base 0.2389 0.2761 0.865 0.400
## eb 20.3707 1.2631 16.128 6.95e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.158 on 15 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9898
## F-statistic: 586 on 3 and 15 DF, p-value: 9.219e-16
The Ftest gives a p-value that is much less than 0.05. Therefore, we reject the null hypothesis and conclude that the regression is statistically significant. At least one predictor is significantly related to hFE. ##### b Predict hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\)
# new data point
new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)
# prediction
predict(model_hfe, new_point)
## 1
## 86.2593
Test for significance of regression using the analysis of variance with \(\alpha=0.05\). What conclusions can you draw?
#extract F-test from summary
summary(model_hfe)
##
## Call:
## lm(formula = hfe ~ emitter + base + eb, data = hfe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9867 -2.3942 0.2606 1.8955 7.8847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.7960 51.7695 -0.402 0.694
## emitter -3.2664 3.6021 -0.907 0.379
## base 0.2389 0.2761 0.865 0.400
## eb 20.3707 1.2631 16.128 6.95e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.158 on 15 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9898
## F-statistic: 586 on 3 and 15 DF, p-value: 9.219e-16
Estimate \(\sigma^2\) for the model you have fit to the data.
# estimate of sigma^2
sigma2_hat <- summary(model_hfe)$sigma^2
sigma2_hat
## [1] 17.29024
Find the standard errors of the regression coefficients.
#coefficient table
coef_table <- summary(model_hfe)$coefficients
coef_table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.7959544 51.7694814 -0.4017030 6.935678e-01
## emitter -3.2664295 3.6021456 -0.9068011 3.788490e-01
## base 0.2389327 0.2761206 0.8653199 4.004914e-01
## eb 20.3706583 1.2630959 16.1275624 6.948639e-11
Estimate Std. Error t value Pr(>|t|) (Intercept) -20.7959544 51.7694814 -0.4017030 6.935678e-01 emitter -3.2664295 3.6021456 -0.9068011 3.788490e-01 base 0.2389327 0.2761206 0.8653199 4.004914e-01 eb 20.3706583 1.2630959 16.1275624 6.948639e-11
Calculate the t-statistic for each regression coefficient. Using \(\alpha=0.05\), what conclusions can you draw? The t-statistics indicate that only the coefficient for emitter-to-base RS is statistically significant (p < 0.05). The coefficients for emitter RS and base RS are not statistically significant, so hFE is primarily influenced by emitter-to-base RS.
Find 99% confidence intervals on the regression coefficients.
# 99% confidence intervals
confint(model_hfe, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) -173.3457523 131.753843
## emitter -13.8809182 7.348059
## base -0.5747156 1.052581
## eb 16.6486773 24.092639
0.5 % 99.5 % (Intercept) -173.3457523 131.753843 emitter -13.8809182 7.348059 base -0.5747156 1.052581 eb 16.6486773 24.092639
Find a 99% prediction interval on hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\).
new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)
# 99% prediction interval
predict(model_hfe, new_point, interval = "prediction", level = 0.99)
## fit lwr upr
## 1 86.2593 71.04857 101.47
fit lwr upr 1 86.2593 71.04857 101.47
Find a 99% confidence interval on mean hFE when \(x_1=14.5\), \(x_2=220\), and \(x_3=5.0\)
#99% confidence interval for mean response
new_point <- data.frame(emitter = 14.5, base = 220, eb = 5.0)
predict(model_hfe, new_point, interval = "confidence", level = 0.99)
## fit lwr upr
## 1 86.2593 77.24635 95.27225
fit lwr upr 1 86.2593 77.24635 95.27225
hfe<-read.csv("./hFE data.csv")
# fit model
colnames(hfe) <- c("emitter", "base", "eb", "hfe")
model_hfe <- lm(hfe ~ emitter + base + eb, data = hfe)
#residuals vs fitted plot
plot(fitted(model_hfe), resid(model_hfe),
xlab = "Fitted Values (ŷ)",
ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
The residuals are scattered around zero without a strong systematic
pattern, which suggests that the linear model is generally
appropriate
summary(model_hfe)$r.squared
## [1] 0.9915393
The coefficient of determination is R2=0.9915, the model explains 99.15% of the variation in hFE.
# refit model using log(hFE)
model_log <- lm(log(hfe) ~ emitter + base + eb, data = hfe)
# view results
summary(model_log)
##
## Call:
## lm(formula = log(hfe) ~ emitter + base + eb, data = hfe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17084 -0.01303 0.01756 0.05101 0.09894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.670571 1.030468 5.503 6.07e-05 ***
## emitter -0.102152 0.071700 -1.425 0.175
## base -0.002780 0.005496 -0.506 0.620
## eb 0.183030 0.025142 7.280 2.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08277 on 15 degrees of freedom
## Multiple R-squared: 0.963, Adjusted R-squared: 0.9556
## F-statistic: 130 on 3 and 15 DF, p-value: 5.852e-11
Call: lm(formula = log(hfe) ~ emitter + base + eb, data = hfe)
Residuals: Min 1Q Median 3Q Max -0.17084 -0.01303 0.01756 0.05101 0.09894
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.670571 1.030468 5.503 6.07e-05 emitter
-0.102152 0.071700 -1.425 0.175
base -0.002780 0.005496 -0.506 0.620
eb 0.183030 0.025142 7.280 2.70e-06 — Signif.
codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08277 on 15 degrees of freedom Multiple R-squared: 0.963, Adjusted R-squared: 0.9556 F-statistic: 130 on 3 and 15 DF, p-value: 5.852e-11
# residuals vs fitted for log model
plot(fitted(model_log), resid(model_log),
xlab = "Fitted Values (ln(hFE))",
ylab = "Residuals",
main = "Residuals vs Fitted (Log Model)")
abline(h = 0, col = "red")
The log transformation improved the model assumptions, especially the
constant variance assumption.
# residuals vs x3 (eb) for log model
plot(hfe$eb, resid(model_log),
xlab = "Emitter-to-Base RS (x3)",
ylab = "Residuals",
main = "Residuals vs x3 (Log Model)")
abline(h = 0, col = "red")
The residuals appear randomly scattered around zero with no clear
pattern when plotted against x3.
# create transformed variable
hfe$inv_eb <- 1 / hfe$eb
# fit new model
model_inv <- lm(log(hfe) ~ emitter + base + inv_eb, data = hfe)
# view results
summary(model_inv)
##
## Call:
## lm(formula = log(hfe) ~ emitter + base + inv_eb, data = hfe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.077879 -0.033676 0.005157 0.028274 0.119432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.016600 0.707536 7.090 3.68e-06 ***
## emitter -0.121638 0.044770 -2.717 0.01591 *
## base 0.010536 0.002816 3.741 0.00197 **
## inv_eb -5.179971 0.449659 -11.520 7.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05616 on 15 degrees of freedom
## Multiple R-squared: 0.983, Adjusted R-squared: 0.9795
## F-statistic: 288.3 on 3 and 15 DF, p-value: 1.756e-13
# optional: residual plots
par(mfrow=c(2,2))
plot(model_inv)
The transformation improves the model. All the regression coefficients
are now statistically significant, and the residual plots show there
could be no clear pattern, indicating better adherence to model
assumptions. This suggests that the transformed model better captures
the relationship between hFE and the predictors.
navy<-read.csv("./Navy.csv")
navy <- read.csv("Navy.csv")
colnames(navy) <- c("man_hours", "cases")
# fit linear regression
model_navy <- lm(man_hours ~ cases, data = navy)
summary(model_navy)
##
## Call:
## lm(formula = man_hours ~ cases, data = navy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1697.1 -1003.8 -561.4 513.1 2652.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 937.6068 631.5202 1.485 0.161
## cases 6.5125 0.5765 11.297 4.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1428 on 13 degrees of freedom
## Multiple R-squared: 0.9076, Adjusted R-squared: 0.9004
## F-statistic: 127.6 on 1 and 13 DF, p-value: 4.295e-08
Call: lm(formula = man_hours ~ cases, data = navy)
Residuals: Min 1Q Median 3Q Max -1697.1 -1003.8 -561.4 513.1 2652.9
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 937.6068 631.5202 1.485 0.161
cases 6.5125 0.5765 11.297 4.3e-08 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1428 on 13 degrees of freedom Multiple R-squared: 0.9076, Adjusted R-squared: 0.9004 F-statistic: 127.6 on 1 and 13 DF, p-value: 4.295e-08
#transformed variable
navy$inv_cases <- 1 / navy$cases
#model
model_log_inv <- lm(log(man_hours) ~ inv_cases, data = navy)
summary(model_log_inv)
##
## Call:
## lm(formula = log(man_hours) ~ inv_cases, data = navy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30376 -0.12213 0.06109 0.10161 0.19086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.66534 0.07077 136.57 < 2e-16 ***
## inv_cases -590.94202 30.03516 -19.68 4.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1589 on 13 degrees of freedom
## Multiple R-squared: 0.9675, Adjusted R-squared: 0.965
## F-statistic: 387.1 on 1 and 13 DF, p-value: 4.67e-11
Call: lm(formula = log(man_hours) ~ inv_cases, data = navy)
Residuals: Min 1Q Median 3Q Max -0.30376 -0.12213 0.06109 0.10161 0.19086
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.66534 0.07077 136.57 < 2e-16 inv_cases
-590.94202 30.03516 -19.68 4.67e-11 — Signif.
codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1589 on 13 degrees of freedom Multiple R-squared: 0.9675, Adjusted R-squared: 0.965 F-statistic: 387.1 on 1 and 13 DF, p-value: 4.67e-11
#transformed variables
navy$inv_cases <- 1 / navy$cases
navy$inv_y <- 1 / navy$man_hours
#model
model_inv <- lm(inv_y ~ inv_cases, data = navy)
summary(model_inv)
##
## Call:
## lm(formula = inv_y ~ inv_cases, data = navy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.427e-05 -2.721e-05 -2.039e-05 4.375e-05 8.143e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.510e-05 2.029e-05 -2.716 0.0177 *
## inv_cases 1.743e-01 8.611e-03 20.246 3.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.556e-05 on 13 degrees of freedom
## Multiple R-squared: 0.9693, Adjusted R-squared: 0.9669
## F-statistic: 409.9 on 1 and 13 DF, p-value: 3.256e-11
Call: lm(formula = inv_y ~ inv_cases, data = navy)
Residuals: Min 1Q Median 3Q Max -7.427e-05 -2.721e-05 -2.039e-05 4.375e-05 8.143e-05
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.510e-05 2.029e-05 -2.716 0.0177 *
inv_cases 1.743e-01 8.611e-03 20.246 3.26e-11 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.556e-05 on 13 degrees of freedom Multiple R-squared: 0.9693, Adjusted R-squared: 0.9669 F-statistic: 409.9 on 1 and 13 DF, p-value: 3.256e-11
# fit quadratic model
model_quad <- lm(man_hours ~ cases + I(cases^2), data = navy)
summary(model_quad)
##
## Call:
## lm(formula = man_hours ~ cases + I(cases^2), data = navy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -842.1 -550.3 98.9 392.7 949.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.294e+03 5.242e+02 -4.375 0.000903 ***
## cases 1.511e+01 1.205e+00 12.545 2.94e-08 ***
## I(cases^2) -3.682e-03 5.038e-04 -7.308 9.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 636.6 on 12 degrees of freedom
## Multiple R-squared: 0.983, Adjusted R-squared: 0.9802
## F-statistic: 347.7 on 2 and 12 DF, p-value: 2.382e-11
Call: lm(formula = man_hours ~ cases + I(cases^2), data = navy)
Residuals: Min 1Q Median 3Q Max -842.1 -550.3 98.9 392.7 949.0
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.294e+03 5.242e+02 -4.375 0.000903 cases
1.511e+01 1.205e+00 12.545 2.94e-08 I(cases^2) -3.682e-03
5.038e-04 -7.308 9.38e-06 *** — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 636.6 on 12 degrees of freedom Multiple R-squared: 0.983, Adjusted R-squared: 0.9802 F-statistic: 347.7 on 2 and 12 DF, p-value: 2.382e-11
nitrogen<-read.csv("./nitrogen.csv")
nitrogen <- read.csv("nitrogen.csv")
colnames(nitrogen) <- c("wind", "temp", "insolation", "no")
#multiple regression
model_nitrogen <- lm(no ~ wind + temp + insolation, data = nitrogen)
# summary
summary(model_nitrogen)
##
## Call:
## lm(formula = no ~ wind + temp + insolation, data = nitrogen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3052 -1.1710 -0.4990 0.9823 3.4033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.784916 7.979647 0.474 0.6402
## wind -0.527410 0.224904 -2.345 0.0289 *
## temp 0.124991 0.075173 1.663 0.1112
## insolation -0.005259 0.006637 -0.792 0.4370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.844 on 21 degrees of freedom
## Multiple R-squared: 0.6533, Adjusted R-squared: 0.6037
## F-statistic: 13.19 on 3 and 21 DF, p-value: 4.628e-05
# residual diagnostics
par(mfrow=c(2,2))
plot(model_nitrogen)
Call: lm(formula = no ~ wind + temp + insolation, data = nitrogen)
Residuals: Min 1Q Median 3Q Max -2.3052 -1.1710 -0.4990 0.9823 3.4033
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.784916 7.979647 0.474 0.6402
wind -0.527410 0.224904 -2.345 0.0289 * temp 0.124991 0.075173 1.663
0.1112
insolation -0.005259 0.006637 -0.792 0.4370
— Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.844 on 21 degrees of freedom Multiple R-squared: 0.6533, Adjusted R-squared: 0.6037 F-statistic: 13.19 on 3 and 21 DF, p-value: 4.628e-05
library(MASS)
# BoxCox plot
boxcox(model_nitrogen,
lambda = seq(-2, 2, 0.1),
main = "BoxCox Transformation")
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'main' will be disregarded