Simple linear regression refers to the fitting of a straight line (with a slope and intercept) that partially determines a random variable \(Y\) from a deterministic variable \(X\). stuff
The formulas for the least squares estimates of the slope (\(\beta_1\)) and intercept (\(\beta_0\)) for the regression of a response/dependent/observed variable \(Y\) on a predictor/independent/controlled variable \(X\) may be derived using calculus taking partial derivatives, no knowledge of the distribution of the response \(Y\) (or equivalently the error term \(\epsilon\)) is necessary. As such, you may always fit a least squares line to any given data.
\(Y=\beta_0+\beta_1x+\epsilon\) - model equation
\(\hat{\beta_1}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}=\frac{\sum x_iy_i-\frac{\sum x_i\sum y_i}{n}}{\sum x_i^2-\frac{(\sum x_i)^2}{n}}\) - Least Squares Estimate of the Slope
\(\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\) - Least Squares Estimate of the Intercept
\(SST=\sum(y_i-\bar{y})^2\) - Sum of Squares Total with \(n-1\) d.f.
\(SSE=\sum(y_i-\hat{y})^2\) - Sum of Squares Error with \(n-2\) d.f.
\(SSR=\sum(\hat{y}_i-\bar{y})^2\) - Sum of Square Regression with \(1\) d.f.
\(SST=SSR+SSE\) - Partition of SST into SSR and SSE
\(R^2=\frac{SSR}{SST}\) - Coefficient of Determination, \(0\leq R^2 \leq 1\)
\(\hat{\sigma^2}=s^2=\frac{SSE}{n-2}=MSE\) - Estimation of population variance by \(MSE\)
In the case that the response/error is Normally distributed, i.e. \(Y\sim N(\mu,\sigma^2) \equiv \epsilon\sim N(0,\sigma^2)\), one may perform inference of the quality of the fit and generate confidence/prediction intervals. It follows that the least squares fitted line is the same as the line that would be found using Maximum Likelihood Estimation., i.e. the fitted line is an estimate of the expected value of \(Y\) at \(X\) given by \(\hat{E[Y|X]}=\hat{\beta_0}+\hat{\beta_1}x\). Moreover, it follows that \(\hat{\beta_1}\sim N\left(0,\frac{\sigma^2}{\sum (x_i-\bar{x})^2}\right)\), \(\hat{E[Y|X_j]}\sim N\left(\hat{\beta_0}+\hat{\beta_1}x_j,\sigma^2\left[\frac{1}{n}+\frac{(x_j-\bar{x})^2}{\sum (x_i-\bar{x})^2}\right]\right)\), and \(\hat{Y}_{new}|X_j\sim N\left(\hat{\beta_0}+\hat{\beta_1}x_j,\sigma^2\left[1+\frac{1}{n}+\frac{(x_j-\bar{x})^2}{\sum (x_i-\bar{x})^2}\right]\right)\), from whence one may perform hypothesis testing on the significance of the slope/regression, and if rejected generate confidence intervals on the mean response or prediction intervals on new observations at a point \(x=x_j\).
Reject Ho if \(\left|\frac{\hat{\beta_1}}{\sqrt{\frac{MSE}{\sum (x_i-\bar{x})^2 }}}\right|\geq t_{1-\alpha/2,n-2}\) - Hypothesis test on significance of slope/regression
Reject Ho if \(\frac{MSR}{MSE}>f_{1-\alpha,1,n-2}\) - Hypothesis test on significance of regression/slope
\(\hat{\beta_0} \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum (x_i-\bar{x})^2}\right]}\right)\) - Confidence Interval on the Intercept
\(\hat{\beta_1} \pm t_{1-\alpha/2,n-2}\left(\sqrt{\frac{MSE}{\sum (x_i-\bar{x})^2}}\right)\) - Confidence Interval on the slope
\(\left(\hat{\beta_0}+\hat{\beta_1} x_j\right) \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[\frac{1}{n}+\frac{x_j-\bar{x}}{\sum (x_i-\bar{x})^2}\right]}\right)\) - Confidence Interval on the expected response at the point \(x_j\)
\(\left(\hat{\beta_0}+\hat{\beta_1} x_j\right) \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[1+\frac{1}{n}+\frac{x_j-\bar{x}}{\sum (x_i-\bar{x})^2}\right]}\right)\) - Prediction Interval on the response at the point $x_j$
lm()<-linear model fitted using least squares
model<-lm(y~x) - Regression of y on x assigned to the R object/variable model
plot(x,y) - Scatterplot of data
plot(x,y), abline(model) - Add fitted regression line to scatter plot
summary(model) - Estimates of the regression parameters and key statistics for the fitted model
#Fit an SLR model for the regression of Y on X
x<-c(1,2,3)
y<-c(4,2,3)
lm(y~x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 4.0 -0.5
x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
plot(x,y)
abline(model)
x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5 6 7
## 0.7500 -0.8182 0.6136 -0.8864 0.4773 -0.4091 0.2727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6818 0.5678 6.484 0.0013 **
## x -0.4318 0.1184 -3.647 0.0148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7568 on 5 degrees of freedom
## Multiple R-squared: 0.7268, Adjusted R-squared: 0.6722
## F-statistic: 13.3 on 1 and 5 DF, p-value: 0.01479
x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
xnew<-seq(min(x),max(x),0.01)
conf<-predict(model,data.frame(x=xnew),interval="confidence")
pred<-predict(model,data.frame(x=xnew),interval="prediction")
plot(x,y)
abline(model)
lines(xnew,conf[,2])
lines(xnew,conf[,3])
lines(xnew,pred[,2])
lines(xnew,pred[,3])
x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
xnew<-c(2.5,3.5)
predict(model,data.frame(x=xnew),interval="confidence")
## fit lwr upr
## 1 2.602273 1.713089 3.491456
## 2 2.170455 1.409583 2.931326
predict(model,data.frame(x=xnew),interval="prediction")
## fit lwr upr
## 1 2.602273 0.46330888 4.741237
## 2 2.170455 0.08156848 4.259341
Multiple regression refers to the fitting of a linear function of multiple deterministic variables in a matrix \(X\) that partially determines a single random variable \(Y\).
The mathematics of multiple regression is an extension to that of simple linear regression using matrix algebra. There is a single response variable \(Y\) that is regressed on multiple predictor variables \(X_i \in \{1,\ldots,k\}\), which are combined together into a mathematical function that is linear in the regression parameters \(\beta_i \in \{0,1,...k\}\). Assuming \(n\) observations and letting \(p=k+1\), it follows that \(y\) is a \((n \ x \ 1)\) vector of the response , \(X\) is a \((n \ x \ p)\) matrix of predictor variables whose first column is a vector of ones, \(\beta\) is a \((p \ x \ 1)\) vector of regression parameters, and \(\epsilon\) is a \((n\ x \ 1)\) vector of the error terms.
\(y= X \beta +\epsilon\) - model equation
\(\hat{\beta}= (X^TX)^{-1}X^Ty\) - Least Squares Estimates of the Regression Parameters
\(H=X(X^TX)^{-1}X^T\) - Hat Matrix
\(\hat{y}=Hy\) - Vector of Fitted Values
\(SSE=y^Ty-\hat{\beta}^TX^Ty\) - Sum of Square Error
\(MSE=\frac{SSE}{(n-p)}\) = Mean Square Error
\(SSR=\hat{\beta}^TX^Ty-n(\bar{y})^2\) - Sum of Square Regression
\(MSR=\frac{SSR}{p-1}\) - Mean Square Regression
\(SST=y^Ty-n(\bar{y})^2\) - Sum of Square Total
\(R^2_{adj}=1-\frac{SSE/(n-p)}{SST/(n-1)}\) - Adjusted R-Square
Reject Ho if \(\frac{MSR}{MSE}>f_{1-\alpha,p-1,n-p}\) - Hypothesis test on significance of full regression model
Reject Ho if \(\frac{\frac{SSR_{full}-SSR_{reduced}}{r}}{MSE_{full}}>f_{1-\alpha,r,n-p}\) - Partial F-test on the signficance of the regression parameters omitted in a reduced model, where \(r\) = number of parameters dropped
\(\sigma^2(X^TX)^{-1}=\sigma^2C\) - Variance of \(\hat{\beta_j}\), Covariance between \(\hat{\beta_i}\) and \(\hat{\beta_j}\)
Reject Ho if \(\left|\frac{\hat{\beta_j}}{\sqrt{(MSE)C_{jj}}}\right|\geq t_{1-\alpha/2,n-p}\) - Hypothesis test on significance of regression parameter\(\beta_j\)
\(\hat{\beta_j}\pm t_{\alpha/2,n-p}\sqrt{(MSE)C_{jj}}\) - Confidence Interval on \(\beta_j\)
\(\hat{y}_o\pm t_{\alpha/2,n-p}\sqrt{(MSE)x_0^T(X^TX)^{-1}x_o}\) - Confidence Interval on Mean Response at \(y=x_0^T\hat{\beta}\)
\(\hat{y}_o\pm t_{\alpha/2,n-p}\sqrt{(MSE)(1+x_0^T(X^TX)^{-1}x_o)}\) - Prediction Interval on New Observation at \(y=x_0^T\hat{\beta}\)
lm(y~x1+x2+x1:x2+I(x1^2)+I(x2^2))
- Least Squares Fit of Second Order Model with Two-Factor Interactionsdat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
head(dat)
## PercentCarbonation Temp Pressure
## 1 2.60 31.0 21
## 2 2.40 31.0 21
## 3 17.32 31.5 24
## 4 15.60 31.5 24
## 5 16.12 31.5 24
## 6 5.36 30.5 22
lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)
##
## Call:
## lm(formula = PercentCarbonation ~ Temp + Pressure + Temp:Pressure +
## I(Temp^2) + I(Pressure^2), data = dat)
##
## Coefficients:
## (Intercept) Temp Pressure I(Temp^2) I(Pressure^2)
## 3025.319 -194.273 -6.051 3.626 1.154
## Temp:Pressure
## -1.332
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
model<-lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)
summary(model)
##
## Call:
## lm(formula = PercentCarbonation ~ Temp + Pressure + Temp:Pressure +
## I(Temp^2) + I(Pressure^2), data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75548 -0.22238 -0.00342 0.26024 0.96452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3025.3186 2045.7464 1.479 0.1897
## Temp -194.2729 132.0643 -1.471 0.1917
## Pressure -6.0507 20.6063 -0.294 0.7789
## I(Temp^2) 3.6259 2.2098 1.641 0.1519
## I(Pressure^2) 1.1542 0.3237 3.565 0.0118 *
## Temp:Pressure -1.3317 0.8962 -1.486 0.1878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6194 on 6 degrees of freedom
## Multiple R-squared: 0.9933, Adjusted R-squared: 0.9877
## F-statistic: 177.2 on 5 and 6 DF, p-value: 1.983e-06
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
Reduced<-lm(PercentCarbonation~Temp+Pressure,data=dat)
Full<-lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)
anova(Reduced,Full)
## Analysis of Variance Table
##
## Model 1: PercentCarbonation ~ Temp + Pressure
## Model 2: PercentCarbonation ~ Temp + Pressure + Temp:Pressure + I(Temp^2) +
## I(Pressure^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 9 8.1242
## 2 6 2.3022 3 5.822 5.0578 0.04415 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The assumption of Normality and Constant Variance (homoscedasticity) in the response/error term, i.e. \(Y\sim N(\mu,\sigma^2) \equiv \epsilon\sim N(0,\sigma^2)\), must be verified for the inferential findings to be valid. Statistical tests often lack sufficient power to reject the null that assumes these assumptions are true due to limited sample sizes, and hence residuals plots are often used to make eyeball judgments on the validity of these.
\(e_i=y_i-\hat{y_i}\) - Residual
\(e_i^*=\frac{e_i}{\sqrt{MSE}}\) - Standardized Residual
\(e=(I-H)y\) - Studentized Residuals
Test for Constant Variance - Plot of Residuals vs. Fitted Values
Test for Normality - Normal Probability Plot of Residuals
Test for Outliers - Plot of Standardized/Studentized Residuals versus Fitted Values
\(h(y)\propto \int [g(y)]^{-1/2} dx\) - Calculating VST transformation \(h(y)\) if \(Var(Y)=g(\mu)\) is known
\(h(y)=\sqrt{y}\) - VST if \(Y\sim Poisson(\lambda)\)
\(h(y) = \sin^{-1}(y)\) - VST if \(Y\sim Bin(n,p)\)
\(h(y) = ln(y)\) - VST if \(Var(Y) \propto [E(Y)]^2\)
\(h(y) = 1/\sqrt{y}\) - VST if \(Var(Y) \propto [E(Y)]^3\)
\(h(y) = 1/y\) - VST if \(Var(Y) \propto [E(Y)]^4\)
\(h(y)=y^{\lambda^*}\) - VST power transformation of Box-Cox, where \(\lambda^*\) is the maximum likelihood estimate of \(\lambda\) in the fitted model \(y^{(\lambda)}=f(x)+\epsilon\) for \(y^{(\lambda)}\) defined in (5.1) of text.
plot(model) - Array of model adequacy plots for the fitted model
plot(model,1) - Plot of residuals versus fitted values
plot(model,2) - Normal Probabilty Plot of residuals
predict(model,data.frame(x=xnew),interval=“confidence”) - confidence interval at the points specified in xnew
predict(model,data.frame(x=xnew),interval=“prediction”) - prediction interval at the points specified in xnew
boxcox(y~x) - plot of likelihood function using Box-Cox (in package MASS)
x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
plot(model,1) #Residuals versus Fitted (Constant Variance)
plot(model,2) #Normal Probability Plot
x<-seq(4,17,1)
y<-c(13.0,16.1,14.5,17.8,22.0,27.4,26.8,34.2,65.6,49.2,66.2,81.2,87.4,114.5)
#Response is known to be Poisson Distributed, use sqrt transformation
y<-sqrt(y)
model<-lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9031 -0.3071 -0.1079 0.5342 0.9378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67186 0.47069 1.427 0.179
## x 0.54081 0.04185 12.923 2.11e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6312 on 12 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.9274
## F-statistic: 167 on 1 and 12 DF, p-value: 2.11e-08
library(MASS)
x<-c(7,8,9,10,11,12)
y<-c(49,31,28,17,16,11)
boxcox(y~x) #choose y^0=ln(y) transform since closest value to mle and 1 not in CI
y<-log(y) #the command log() in R is base e by default, and hence ln() mathematically
boxcox(y~x) #note that 1 is now in CI
model<-lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4 5 6
## 0.070620 -0.102847 0.079737 -0.134889 0.088853 -0.001474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.81176 0.25279 22.99 2.12e-05 ***
## x -0.28437 0.02619 -10.86 0.000408 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1096 on 4 degrees of freedom
## Multiple R-squared: 0.9672, Adjusted R-squared: 0.959
## F-statistic: 117.9 on 1 and 4 DF, p-value: 0.0004083
Observations within a dataset may be labeled by the amount in which they are an outlier in their response, have leverage in their predictors, and influence the fit of the model.
Those observation that have influence need to be carefully examined to determine whether this is a mistake (the result of measurement error) or extreme observations (possilbe but unlikely). A determination will need to be made how to handle these points (leave, remove, impute).
plot(model,3) - plot of the square root of the absolute value of the standardized residual by the fitted value
plot(model,4) - plot of Cook’s distance by observation number
plot(model,5) - plot of the standardized residual versus leverage, with overlay of the Cook’s distance confidence interval
plot(model,6) - plot of Cook’s distance by leverage
Function that calculates the square root of the studentized residuals for outlier detection
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkDeliveryTime.csv")
model<-lm(DeliveryTime.min.~NumCases+Distance.ft.,data=dat)
plot(model,3) #plot of the Sqrt of the Abs of the St. Residual by Fitted Value
plot(model,4) #plot of Cook's Distance by Obs. Number
plot(model,5) #plot of St. Residual by Leverage with Cook's Dist. C.I. Overlay
plot(model,6) #plot of Cooks's Dist. by Leverage
dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkDeliveryTime.csv")
model<-lm(DeliveryTime.min.~NumCases+Distance.ft.,data=dat)
residfcn<-function(model){
dat<-as.data.frame(rstudent(model))
pt_num<-order(-dat)
st_resid<-sqrt(abs(dat[order(-dat[,1]),]))
print("Outliers - Sqrt of Absolute Value of Studentized Residuals")
cbind(pt_num,st_resid)}
residfcn(model)#Sqrt of the Abs of the St. Residual for each observation
## [1] "Outliers - Sqrt of Absolute Value of Studentized Residuals"
## pt_num st_resid
## [1,] 9 2.0762418
## [2,] 4 1.2802988
## [3,] 18 1.0579842
## [4,] 10 0.8982070
## [5,] 11 0.8425788
## [6,] 19 0.7548604
## [7,] 8 0.5994913
## [8,] 2 0.5979445
## [9,] 14 0.5780807
## [10,] 13 0.5643308
## [11,] 7 0.5144392
## [12,] 15 0.4535011
## [13,] 17 0.3673200
## [14,] 3 0.1253865
## [15,] 25 0.2568332
## [16,] 6 0.2978880
## [17,] 5 0.3722431
## [18,] 12 0.4347120
## [19,] 16 0.4667180
## [20,] 21 0.9343912
## [21,] 23 1.2175661
## [22,] 22 1.2205018
## [23,] 24 1.2418595
## [24,] 1 1.3021631
## [25,] 20 1.4130381
hatfcn<-function(model){
dat<-as.data.frame(hatvalues(model))
pt_num<-order(-dat)
hatval<-dat[order(-dat[,1]),]
print("Leverage - Hat Values")
cbind(pt_num,hatval)}
hatfcn(model)#leverage of each observation
## [1] "Leverage - Hat Values"
## pt_num hatval
## [1,] 9 0.49829216
## [2,] 22 0.39157522
## [3,] 10 0.19629595
## [4,] 16 0.16594043
## [5,] 21 0.16527689
## [6,] 24 0.12060826
## [7,] 12 0.11365570
## [8,] 1 0.10180178
## [9,] 20 0.10168486
## [10,] 3 0.09873476
## [11,] 19 0.09644857
## [12,] 18 0.09626046
## [13,] 11 0.08613260
## [14,] 4 0.08537479
## [15,] 7 0.08179867
## [16,] 14 0.07824332
## [17,] 5 0.07501050
## [18,] 2 0.07070164
## [19,] 25 0.06664345
## [20,] 8 0.06372559
## [21,] 13 0.06112463
## [22,] 17 0.05943202
## [23,] 6 0.04286693
## [24,] 23 0.04126005
## [25,] 15 0.04111077
cooksfcn<-function(model){
dat<-as.data.frame(cooks.distance(model))
pt_num<-order(-dat)
cooksd<-dat[order(-dat[,1]),]
print("Influence - Cooks Distance")
cbind(pt_num,cooksd)}
cooksfcn(model)#Influence of each observation
## [1] "Influence - Cooks Distance"
## pt_num cooksd
## [1,] 9 3.419318e+00
## [2,] 22 4.510455e-01
## [3,] 20 1.324449e-01
## [4,] 24 1.023224e-01
## [5,] 1 1.000921e-01
## [6,] 4 7.764718e-02
## [7,] 10 5.384516e-02
## [8,] 21 5.086063e-02
## [9,] 18 4.397807e-02
## [10,] 23 2.989892e-02
## [11,] 11 1.619975e-02
## [12,] 19 1.191868e-02
## [13,] 2 3.375704e-03
## [14,] 14 3.292786e-03
## [15,] 16 3.289086e-03
## [16,] 8 3.051135e-03
## [17,] 13 2.294737e-03
## [18,] 7 2.171604e-03
## [19,] 12 1.596392e-03
## [20,] 15 6.319880e-04
## [21,] 5 5.432217e-04
## [22,] 17 4.013419e-04
## [23,] 6 1.231067e-04
## [24,] 25 1.084694e-04
## [25,] 3 9.455785e-06