1 Simple Linear Regression

Simple linear regression refers to the fitting of a straight line (with a slope and intercept) that partially determines a random variable $Y$ from a deterministic variable $X$. stuff

1.1 Mathematics

1.1.1 Estimation

The formulas for the least squares estimates of the slope ($\beta_1$) and intercept ($\beta_0$) for the regression of a response/dependent/observed variable $Y$ on a predictor/independent/controlled variable $X$ may be derived using calculus taking partial derivatives, no knowledge of the distribution of the response $Y$ (or equivalently the error term $\epsilon$) is necessary. As such, you may always fit a least squares line to any given data.

$Y=\beta_0+\beta_1x+\epsilon$ - model equation
$\hat{\beta_1}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}=\frac{\sum x_iy_i-\frac{\sum x_i\sum y_i}{n}}{\sum x_i^2-\frac{(\sum x_i)^2}{n}}$ - Least Squares Estimate of the Slope
$\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$ - Least Squares Estimate of the Intercept
$SST=\sum(y_i-\bar{y})^2$ - Sum of Squares Total with $n-1$ d.f.
$SSE=\sum(y_i-\hat{y})^2$ - Sum of Squares Error with $n-2$ d.f.
$SSR=\sum(\hat{y}_i-\bar{y})^2$ - Sum of Square Regression with $1$ d.f.
$SST=SSR+SSE$ - Partition of SST into SSR and SSE
$R^2=\frac{SSR}{SST}$ - Coefficient of Determination, $0\leq R^2 \leq 1$
$\hat{\sigma^2}=s^2=\frac{SSE}{n-2}=MSE$ - Estimation of population variance by $MSE$

1.1.2 Inference

In the case that the response/error is Normally distributed, i.e. $Y\sim N(\mu,\sigma^2) \equiv \epsilon\sim N(0,\sigma^2)$, one may perform inference of the quality of the fit and generate confidence/prediction intervals. It follows that the least squares fitted line is the same as the line that would be found using Maximum Likelihood Estimation., i.e. the fitted line is an estimate of the expected value of $Y$ at $X$ given by $\hat{E[Y|X]}=\hat{\beta_0}+\hat{\beta_1}x$. Moreover, it follows that $\hat{\beta_1}\sim N\left(0,\frac{\sigma^2}{\sum (x_i-\bar{x})^2}\right)$, $\hat{E[Y|X_j]}\sim N\left(\hat{\beta_0}+\hat{\beta_1}x_j,\sigma^2\left[\frac{1}{n}+\frac{(x_j-\bar{x})^2}{\sum (x_i-\bar{x})^2}\right]\right)$, and $\hat{Y}_{new}|X_j\sim N\left(\hat{\beta_0}+\hat{\beta_1}x_j,\sigma^2\left[1+\frac{1}{n}+\frac{(x_j-\bar{x})^2}{\sum (x_i-\bar{x})^2}\right]\right)$, from whence one may perform hypothesis testing on the significance of the slope/regression, and if rejected generate confidence intervals on the mean response or prediction intervals on new observations at a point $x=x_j$.

Reject Ho if $\left|\frac{\hat{\beta_1}}{\sqrt{\frac{MSE}{\sum (x_i-\bar{x})^2 }}}\right|\geq t_{1-\alpha/2,n-2}$ - Hypothesis test on significance of slope/regression
Reject Ho if $\frac{MSR}{MSE}>f_{1-\alpha,1,n-2}$ - Hypothesis test on significance of regression/slope
$\hat{\beta_0} \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum (x_i-\bar{x})^2}\right]}\right)$ - Confidence Interval on the Intercept
$\hat{\beta_1} \pm t_{1-\alpha/2,n-2}\left(\sqrt{\frac{MSE}{\sum (x_i-\bar{x})^2}}\right)$ - Confidence Interval on the slope
$\left(\hat{\beta_0}+\hat{\beta_1} x_j\right) \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[\frac{1}{n}+\frac{x_j-\bar{x}}{\sum (x_i-\bar{x})^2}\right]}\right)$ - Confidence Interval on the expected response at the point $x_j$
$\left(\hat{\beta_0}+\hat{\beta_1} x_j\right) \pm t_{1-\alpha/2,n-2}\left(\sqrt{MSE\left[1+\frac{1}{n}+\frac{x_j-\bar{x}}{\sum (x_i-\bar{x})^2}\right]}\right)$ - Prediction Interval on the response at the point $x_j$

1.2 Commands in R

lm()<-linear model fitted using least squares
model<-lm(y~x) - Regression of y on x assigned to the R object/variable model
plot(x,y) - Scatterplot of data
plot(x,y), abline(model) - Add fitted regression line to scatter plot
summary(model) - Estimates of the regression parameters and key statistics for the fitted model

1.3 Examples using R

1.3.1 Least Squares Fit of Regression Parameters

#Fit an SLR model for the regression of Y on X 
x<-c(1,2,3)
y<-c(4,2,3)
lm(y~x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##         4.0         -0.5

1.3.2 Plot of Data with Fitted Regression Line

x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
plot(x,y)
abline(model)

1.3.3 Statistical Inference

x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
summary(model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       1       2       3       4       5       6       7 
##  0.7500 -0.8182  0.6136 -0.8864  0.4773 -0.4091  0.2727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.6818     0.5678   6.484   0.0013 **
## x            -0.4318     0.1184  -3.647   0.0148 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7568 on 5 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:  0.6722 
## F-statistic:  13.3 on 1 and 5 DF,  p-value: 0.01479

1.3.4 Plot of Confidence and Prediction Intervals

x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
xnew<-seq(min(x),max(x),0.01)
conf<-predict(model,data.frame(x=xnew),interval="confidence")
pred<-predict(model,data.frame(x=xnew),interval="prediction")
plot(x,y)
abline(model)
lines(xnew,conf[,2])
lines(xnew,conf[,3])
lines(xnew,pred[,2])
lines(xnew,pred[,3])

1.3.5 Confidence and Prediction Intervals at a Point(s)

x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
xnew<-c(2.5,3.5)
predict(model,data.frame(x=xnew),interval="confidence")

##        fit      lwr      upr
## 1 2.602273 1.713089 3.491456
## 2 2.170455 1.409583 2.931326

predict(model,data.frame(x=xnew),interval="prediction")

##        fit        lwr      upr
## 1 2.602273 0.46330888 4.741237
## 2 2.170455 0.08156848 4.259341

2 Multiple Regression

Multiple regression refers to the fitting of a linear function of multiple deterministic variables in a matrix $X$ that partially determines a single random variable $Y$.

2.1 Mathematics

The mathematics of multiple regression is an extension to that of simple linear regression using matrix algebra. There is a single response variable $Y$ that is regressed on multiple predictor variables $X_i \in \{1,\ldots,k\}$, which are combined together into a mathematical function that is linear in the regression parameters $\beta_i \in \{0,1,...k\}$. Assuming $n$ observations and letting $p=k+1$, it follows that $y$ is a $(n \ x \ 1)$ vector of the response , $X$ is a $(n \ x \ p)$ matrix of predictor variables whose first column is a vector of ones, $\beta$ is a $(p \ x \ 1)$ vector of regression parameters, and $\epsilon$ is a $(n\ x \ 1)$ vector of the error terms.

$y= X \beta +\epsilon$ - model equation
$\hat{\beta}= (X^TX)^{-1}X^Ty$ - Least Squares Estimates of the Regression Parameters
$H=X(X^TX)^{-1}X^T$ - Hat Matrix
$\hat{y}=Hy$ - Vector of Fitted Values
$SSE=y^Ty-\hat{\beta}^TX^Ty$ - Sum of Square Error
$MSE=\frac{SSE}{(n-p)}$ = Mean Square Error
$SSR=\hat{\beta}^TX^Ty-n(\bar{y})^2$ - Sum of Square Regression
$MSR=\frac{SSR}{p-1}$ - Mean Square Regression
$SST=y^Ty-n(\bar{y})^2$ - Sum of Square Total
$R^2_{adj}=1-\frac{SSE/(n-p)}{SST/(n-1)}$ - Adjusted R-Square
Reject Ho if $\frac{MSR}{MSE}>f_{1-\alpha,p-1,n-p}$ - Hypothesis test on significance of full regression model
Reject Ho if $\frac{\frac{SSR_{full}-SSR_{reduced}}{r}}{MSE_{full}}>f_{1-\alpha,r,n-p}$ - Partial F-test on the signficance of the regression parameters omitted in a reduced model, where $r$ = number of parameters dropped
$\sigma^2(X^TX)^{-1}=\sigma^2C$ - Variance of $\hat{\beta_j}$, Covariance between $\hat{\beta_i}$ and $\hat{\beta_j}$
Reject Ho if $\left|\frac{\hat{\beta_j}}{\sqrt{(MSE)C_{jj}}}\right|\geq t_{1-\alpha/2,n-p}$ - Hypothesis test on significance of regression parameter$\beta_j$
$\hat{\beta_j}\pm t_{\alpha/2,n-p}\sqrt{(MSE)C_{jj}}$ - Confidence Interval on $\beta_j$
$\hat{y}_o\pm t_{\alpha/2,n-p}\sqrt{(MSE)x_0^T(X^TX)^{-1}x_o}$ - Confidence Interval on Mean Response at $y=x_0^T\hat{\beta}$
$\hat{y}_o\pm t_{\alpha/2,n-p}\sqrt{(MSE)(1+x_0^T(X^TX)^{-1}x_o)}$ - Prediction Interval on New Observation at $y=x_0^T\hat{\beta}$

2.2 Commands in R

lm(y~x1*x2) - Least Squares Fit of First Order Model with Two-Factor Interactions
lm(y~x1+x2+x1:x2+I(x1^2)+I(x2^2)) - Least Squares Fit of Second Order Model with Two-Factor Interactions
Full<-lm(), Reduced<-lm(), anova(Reduced,Full) - Partial F-Test

2.3 Examples using R

2.3.1 Least Squares Fit of Regression Parameters

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
head(dat)

##   PercentCarbonation Temp Pressure
## 1               2.60 31.0       21
## 2               2.40 31.0       21
## 3              17.32 31.5       24
## 4              15.60 31.5       24
## 5              16.12 31.5       24
## 6               5.36 30.5       22

lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)

## 
## Call:
## lm(formula = PercentCarbonation ~ Temp + Pressure + Temp:Pressure + 
##     I(Temp^2) + I(Pressure^2), data = dat)
## 
## Coefficients:
##   (Intercept)           Temp       Pressure      I(Temp^2)  I(Pressure^2)  
##      3025.319       -194.273         -6.051          3.626          1.154  
## Temp:Pressure  
##        -1.332

2.3.2 Statistical Inference

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
model<-lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)
summary(model)

## 
## Call:
## lm(formula = PercentCarbonation ~ Temp + Pressure + Temp:Pressure + 
##     I(Temp^2) + I(Pressure^2), data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75548 -0.22238 -0.00342  0.26024  0.96452 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3025.3186  2045.7464   1.479   0.1897  
## Temp          -194.2729   132.0643  -1.471   0.1917  
## Pressure        -6.0507    20.6063  -0.294   0.7789  
## I(Temp^2)        3.6259     2.2098   1.641   0.1519  
## I(Pressure^2)    1.1542     0.3237   3.565   0.0118 *
## Temp:Pressure   -1.3317     0.8962  -1.486   0.1878  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6194 on 6 degrees of freedom
## Multiple R-squared:  0.9933, Adjusted R-squared:  0.9877 
## F-statistic: 177.2 on 5 and 6 DF,  p-value: 1.983e-06

2.3.3 Partial F-Test

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkCarb.csv")
Reduced<-lm(PercentCarbonation~Temp+Pressure,data=dat)
Full<-lm(PercentCarbonation~Temp+Pressure+Temp:Pressure+I(Temp^2)+I(Pressure^2),data=dat)
anova(Reduced,Full)

## Analysis of Variance Table
## 
## Model 1: PercentCarbonation ~ Temp + Pressure
## Model 2: PercentCarbonation ~ Temp + Pressure + Temp:Pressure + I(Temp^2) + 
##     I(Pressure^2)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1      9 8.1242                              
## 2      6 2.3022  3     5.822 5.0578 0.04415 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3 Model Adequacy and Variance Stabilizing Transformations (VST)

3.1 Mathematics:

The assumption of Normality and Constant Variance (homoscedasticity) in the response/error term, i.e. $Y\sim N(\mu,\sigma^2) \equiv \epsilon\sim N(0,\sigma^2)$, must be verified for the inferential findings to be valid. Statistical tests often lack sufficient power to reject the null that assumes these assumptions are true due to limited sample sizes, and hence residuals plots are often used to make eyeball judgments on the validity of these.

$e_i=y_i-\hat{y_i}$ - Residual
$e_i^*=\frac{e_i}{\sqrt{MSE}}$ - Standardized Residual
$e=(I-H)y$ - Studentized Residuals
Test for Constant Variance - Plot of Residuals vs. Fitted Values
Test for Normality - Normal Probability Plot of Residuals
Test for Outliers - Plot of Standardized/Studentized Residuals versus Fitted Values
$h(y)\propto \int [g(y)]^{-1/2} dx$ - Calculating VST transformation $h(y)$ if $Var(Y)=g(\mu)$ is known
$h(y)=\sqrt{y}$ - VST if $Y\sim Poisson(\lambda)$
$h(y) = \sin^{-1}(y)$ - VST if $Y\sim Bin(n,p)$
$h(y) = ln(y)$ - VST if $Var(Y) \propto [E(Y)]^2$
$h(y) = 1/\sqrt{y}$ - VST if $Var(Y) \propto [E(Y)]^3$
$h(y) = 1/y$ - VST if $Var(Y) \propto [E(Y)]^4$
$h(y)=y^{\lambda^*}$ - VST power transformation of Box-Cox, where $\lambda^*$ is the maximum likelihood estimate of $\lambda$ in the fitted model $y^{(\lambda)}=f(x)+\epsilon$ for $y^{(\lambda)}$ defined in (5.1) of text.

3.2 Commands in R

plot(model) - Array of model adequacy plots for the fitted model
plot(model,1) - Plot of residuals versus fitted values
plot(model,2) - Normal Probabilty Plot of residuals
predict(model,data.frame(x=xnew),interval=“confidence”) - confidence interval at the points specified in xnew
predict(model,data.frame(x=xnew),interval=“prediction”) - prediction interval at the points specified in xnew
boxcox(y~x) - plot of likelihood function using Box-Cox (in package MASS)

3.3 Examples using R

3.3.1 Model Adequacy

x<-c(1,2,3,3,5,7,8)
y<-c(4,2,3,1.5,2,0.25,0.5)
model<-lm(y~x)
plot(model,1) #Residuals versus Fitted (Constant Variance)

plot(model,2) #Normal Probability Plot

3.3.2 VST - Known Response Distribution

x<-seq(4,17,1)
y<-c(13.0,16.1,14.5,17.8,22.0,27.4,26.8,34.2,65.6,49.2,66.2,81.2,87.4,114.5)
#Response is known to be Poisson Distributed, use sqrt transformation
y<-sqrt(y)
model<-lm(y~x)
summary(model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9031 -0.3071 -0.1079  0.5342  0.9378 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.67186    0.47069   1.427    0.179    
## x            0.54081    0.04185  12.923 2.11e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6312 on 12 degrees of freedom
## Multiple R-squared:  0.933,  Adjusted R-squared:  0.9274 
## F-statistic:   167 on 1 and 12 DF,  p-value: 2.11e-08

3.3.3 VST - Box-Cox Power Transformation

library(MASS)
x<-c(7,8,9,10,11,12)
y<-c(49,31,28,17,16,11)
boxcox(y~x) #choose y^0=ln(y) transform since closest value to mle and 1 not in CI

y<-log(y) #the command log() in R is base e by default, and hence ln() mathematically
boxcox(y~x) #note that 1 is now in CI

model<-lm(y~x)
summary(model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##         1         2         3         4         5         6 
##  0.070620 -0.102847  0.079737 -0.134889  0.088853 -0.001474 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.81176    0.25279   22.99 2.12e-05 ***
## x           -0.28437    0.02619  -10.86 0.000408 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1096 on 4 degrees of freedom
## Multiple R-squared:  0.9672, Adjusted R-squared:  0.959 
## F-statistic: 117.9 on 1 and 4 DF,  p-value: 0.0004083

4 Outliers, Leverage, and Influence

Observations within a dataset may be labeled by the amount in which they are an outlier in their response, have leverage in their predictors, and influence the fit of the model.

points that are outliers are those observations in a dataset whose response variable is distant from other observations
points that have leverage are those observations in a dataset whose predictor variables are distant from other observations
points that have influence are those observations in a dataset whose response is an outlier and whose predictors have leverage

Those observation that have influence need to be carefully examined to determine whether this is a mistake (the result of measurement error) or extreme observations (possilbe but unlikely). A determination will need to be made how to handle these points (leave, remove, impute).

4.1 Mathematics

$e=(I-H)y$ - studentized residuals
$h_{ij}$ - leverage of the ith observation on the jth fitted value
$h_{ii} = x_i^T(X^TX)^{-1}x_i$ - leverage of the ith observation on the centroid of all predictor variables
$h_{ii}>\frac{2p}{n}$ - rule of thumb for determining pts of leverage
$D_i=\frac{(\hat{\beta}_{(i)}-\hat{\beta})^TX^TX(\hat{\beta}_{(i)}-\hat{\beta})}{p*MSE}$ - Cook’s Distance of ith observation
$D_i>1$ - rule of thumb for determining pts of influence

4.2 Commands in R

plot(model,3) - plot of the square root of the absolute value of the standardized residual by the fitted value
plot(model,4) - plot of Cook’s distance by observation number
plot(model,5) - plot of the standardized residual versus leverage, with overlay of the Cook’s distance confidence interval
plot(model,6) - plot of Cook’s distance by leverage
Function that calculates the square root of the studentized residuals for outlier detection

           residfcn<-function(model){
            dat<-as.data.frame(rstudent(model))
            pt_num<-order(-dat)
            st_resid<-sqrt(abs(dat[order(-dat[,1]),]))
            print(“Outliers - Sqrt of Absolute Value of Studentized Residuals”)
            cbind(pt_num,st_resid)}

Function that calculates the hat values for determining leverage

           hatfcn<-function(model){
            dat<-as.data.frame(hatvalues(model))
            pt_num<-order(-dat)
            hatval<-dat[order(-dat[,1]),]
            print(“Leverage - Hat Values”)
            cbind(pt_num,hatval)}

Function taht calculates the Cooks Distance for detemining influence

          cooksfcn<-function(model){
            dat<-as.data.frame(cooks.distance(model))
            pt_num<-order(-dat)
            cooksd<-dat[order(-dat[,1]),]
            print(“Influence - Cooks Distance”)
            cbind(pt_num,cooksd)}

4.3 Examples using R

4.3.1 Diagnostic Plots for Ouliers, Leverage, and Influence

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkDeliveryTime.csv")
model<-lm(DeliveryTime.min.~NumCases+Distance.ft.,data=dat)     
plot(model,3) #plot of the Sqrt of the Abs of the St. Residual by Fitted Value

plot(model,4) #plot of Cook's Distance by Obs. Number

plot(model,5) #plot of St. Residual by Leverage with Cook's Dist. C.I. Overlay

plot(model,6) #plot of Cooks's Dist. by Leverage

4.3.2 Examination of Outliers, Leverage, and Influence by Observation

dat<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/main/SoftDrinkDeliveryTime.csv")
model<-lm(DeliveryTime.min.~NumCases+Distance.ft.,data=dat)     
residfcn<-function(model){
  dat<-as.data.frame(rstudent(model))
  pt_num<-order(-dat)
  st_resid<-sqrt(abs(dat[order(-dat[,1]),]))
  print("Outliers - Sqrt of Absolute Value of Studentized Residuals")
  cbind(pt_num,st_resid)}
residfcn(model)#Sqrt of the Abs of the St. Residual for each observation

## [1] "Outliers - Sqrt of Absolute Value of Studentized Residuals"

##       pt_num  st_resid
##  [1,]      9 2.0762418
##  [2,]      4 1.2802988
##  [3,]     18 1.0579842
##  [4,]     10 0.8982070
##  [5,]     11 0.8425788
##  [6,]     19 0.7548604
##  [7,]      8 0.5994913
##  [8,]      2 0.5979445
##  [9,]     14 0.5780807
## [10,]     13 0.5643308
## [11,]      7 0.5144392
## [12,]     15 0.4535011
## [13,]     17 0.3673200
## [14,]      3 0.1253865
## [15,]     25 0.2568332
## [16,]      6 0.2978880
## [17,]      5 0.3722431
## [18,]     12 0.4347120
## [19,]     16 0.4667180
## [20,]     21 0.9343912
## [21,]     23 1.2175661
## [22,]     22 1.2205018
## [23,]     24 1.2418595
## [24,]      1 1.3021631
## [25,]     20 1.4130381

hatfcn<-function(model){
  dat<-as.data.frame(hatvalues(model))
  pt_num<-order(-dat)
  hatval<-dat[order(-dat[,1]),]
  print("Leverage - Hat Values")
  cbind(pt_num,hatval)}
hatfcn(model)#leverage of each observation

## [1] "Leverage - Hat Values"

##       pt_num     hatval
##  [1,]      9 0.49829216
##  [2,]     22 0.39157522
##  [3,]     10 0.19629595
##  [4,]     16 0.16594043
##  [5,]     21 0.16527689
##  [6,]     24 0.12060826
##  [7,]     12 0.11365570
##  [8,]      1 0.10180178
##  [9,]     20 0.10168486
## [10,]      3 0.09873476
## [11,]     19 0.09644857
## [12,]     18 0.09626046
## [13,]     11 0.08613260
## [14,]      4 0.08537479
## [15,]      7 0.08179867
## [16,]     14 0.07824332
## [17,]      5 0.07501050
## [18,]      2 0.07070164
## [19,]     25 0.06664345
## [20,]      8 0.06372559
## [21,]     13 0.06112463
## [22,]     17 0.05943202
## [23,]      6 0.04286693
## [24,]     23 0.04126005
## [25,]     15 0.04111077

cooksfcn<-function(model){
  dat<-as.data.frame(cooks.distance(model))
  pt_num<-order(-dat)
  cooksd<-dat[order(-dat[,1]),]
  print("Influence - Cooks Distance")
  cbind(pt_num,cooksd)}
cooksfcn(model)#Influence of each observation

## [1] "Influence - Cooks Distance"

##       pt_num       cooksd
##  [1,]      9 3.419318e+00
##  [2,]     22 4.510455e-01
##  [3,]     20 1.324449e-01
##  [4,]     24 1.023224e-01
##  [5,]      1 1.000921e-01
##  [6,]      4 7.764718e-02
##  [7,]     10 5.384516e-02
##  [8,]     21 5.086063e-02
##  [9,]     18 4.397807e-02
## [10,]     23 2.989892e-02
## [11,]     11 1.619975e-02
## [12,]     19 1.191868e-02
## [13,]      2 3.375704e-03
## [14,]     14 3.292786e-03
## [15,]     16 3.289086e-03
## [16,]      8 3.051135e-03
## [17,]     13 2.294737e-03
## [18,]      7 2.171604e-03
## [19,]     12 1.596392e-03
## [20,]     15 6.319880e-04
## [21,]      5 5.432217e-04
## [22,]     17 4.013419e-04
## [23,]      6 1.231067e-04
## [24,]     25 1.084694e-04
## [25,]      3 9.455785e-06

Key Concepts from Statistical Data Analysis

Timothy I. Matis, PhD PE

March 03, 2022