Matis Lecture 8

library(qcc)

## Warning: package 'qcc' was built under R version 4.3.3

## Package 'qcc' version 2.7

## Type 'citation("qcc")' for citing this R package in publications.

library(stats)
library(MASS)

Ok, lecture 9 in Tim Matis series on statistical analysis/DoE.

There’s some error in our model. That’s the epsilon. Linear regression is linear in the regression, not in the model. The model can be nonlinear. Do we ever do nonlinear regression? Yes, but it’s very specialized, complicated and rare. Simplest function we can write y = f(x) +e. For a linear function, we have y = mx +b, usually written as y = beta0 + beta1 x + epsilon. Slope, intercept, and error. What if we have other predictors? beta1, beta2, etc. with arguments x1, x2. Might need to have interaction terms again. As long as the beta’s only appear as power 1 this is a linear regression function. What’s not? y= beta0^2 or exp(beta1).

ML does not have a model. Linear regression always has a model, function, or theory. ML is really a black box. ML is really good at classification. Regression is much better than ML on continuous variables.

Step 1 - write the model

Step 2 - fit the model, i.e., estimate the parameters, the betas

Two ways to do this:

Least squares is just calculus, doesn’t assume normality in the data or in the error. Maximum likelihood requires that we know the distribution of the response. It’s not hard to understand ML, but math is nasty.

Least squares:

Take partial derivatives with respect to all betas. We have n equations in n unknowns. A different way for slr is least squares. Sum the squares of the deviations and find the betas that makes that a minimum.

So, make a data table:

Y1 1 x1 x2 x3 x4 xk

Y2 1 x1 x2 x3 x4 xk

solution is betahat = (xtranspose x)^-1 xtranspose y

always get a line then figure out how good is it.

Now, what about maximum likelihood?

For ML need to know the distribution. Assuming the observations are independent, the joint probability distribution is the product of all the individual probability distributions. Take partial derivatives, set equal to zero and solve simultaneous equations. Ok, an example:

For this example, wire length and die height are predictors of pull strength.

dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/WireBondSemiConductorMfg.csv")
head(dat)

##   Pull.Strength..Y. Wire.Length..X1. Die.Height..X2.
## 1              9.95                2              50
## 2             24.45                8             110
## 3             31.75               11             120
## 4             35.00               10             550
## 5             25.02                8             295
## 6             16.86                4             200

colnames(dat)<-c("Strength","Length", "Height")
head(dat)

##   Strength Length Height
## 1     9.95      2     50
## 2    24.45      8    110
## 3    31.75     11    120
## 4    35.00     10    550
## 5    25.02      8    295
## 6    16.86      4    200

plot(dat$Length, dat$Strength,
     main="Wire Bond: Pull Strength vs. Wire Length",
     xlab="Length",
     ylab="Strength")

plot(dat$Height, dat$Strength,
     main="Wire Bond: Pull Strength vs. Die Height",
     xlab="Height",
     ylab="Strength")

model1<-lm(Strength~Length, data=dat)
plot(dat$Length, dat$Strength,
     main="Wire Bond: Pull Strength vs. Wire Length",
     xlab="Length",
     ylab="Strength")
abline(model1)

model1$coefficients

## (Intercept)      Length 
##    5.114516    2.902704

summary(model1)

## 
## Call:
## lm(formula = Strength ~ Length, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8889 -1.3199  0.3547  2.0030  5.8314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.114      1.146   4.464 0.000177 ***
## Length         2.903      0.117  24.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.093 on 23 degrees of freedom
## Multiple R-squared:  0.964,  Adjusted R-squared:  0.9624 
## F-statistic: 615.1 on 1 and 23 DF,  p-value: < 2.2e-16

model2<-lm(Strength~Height,data=dat)
plot(dat$Height, dat$Strength,
     main="Wire Bond: Pull Strength vs. Die Height",
     xlab="Height",
     ylab="Strength")
abline(model2)

model2$coefficients

## (Intercept)      Height 
## 14.56780276  0.04360079

summary(model2)

## 
## Call:
## lm(formula = Strength ~ Height, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.774  -7.235  -1.856   5.582  28.272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 14.56780    6.03259   2.415   0.0241 *
## Height       0.04360    0.01605   2.717   0.0123 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 23 degrees of freedom
## Multiple R-squared:  0.2429, Adjusted R-squared:   0.21 
## F-statistic:  7.38 on 1 and 23 DF,  p-value: 0.01231

Now let’s try multiple regression:

model3<-lm(Strength~Length+Height+Length:Height,data=dat)
model3$coefficients

##   (Intercept)        Length        Height Length:Height 
##   6.602976132   2.124315904   0.001064586   0.001481100

summary(model3)

## 
## Call:
## lm(formula = Strength ~ Length + Height + Length:Height, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8113 -1.1029 -0.0058  0.5374  5.1170 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6029761  1.4158295   4.664 0.000133 ***
## Length        2.1243159  0.1791705  11.856 9.09e-11 ***
## Height        0.0010646  0.0037394   0.285 0.778665    
## Length:Height 0.0014811  0.0003901   3.796 0.001056 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.803 on 21 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9872 
## F-statistic: 618.8 on 3 and 21 DF,  p-value: < 2.2e-16

plot(model3)

Insert TIm’s lm in R here

Now consider a different data set:
Tthis would be a bernoulli experiment when the response is pass/fail or 0/1 , etc.

newdat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/XFab_SECOM_LogisticRegression.csv")
nrow(newdat)

## [1] 1567

ncol(newdat)

## [1] 592

str(newdat)

## 'data.frame':    1567 obs. of  592 variables:
##  $ Class     : chr  "pass" "pass" "fail" "pass" ...
##  $ Time      : chr  "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
##  $ Feature1  : num  3031 3096 2933 2989 3032 ...
##  $ Feature2  : num  2564 2465 2560 2480 2503 ...
##  $ Feature3  : num  2188 2230 2186 2199 2233 ...
##  $ Feature4  : num  1411 1464 1698 910 1327 ...
##  $ Feature5  : num  1.36 0.829 1.51 1.32 1.533 ...
##  $ Feature6  : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ Feature7  : num  97.6 102.3 95.5 104.2 100.4 ...
##  $ Feature8  : num  0.124 0.125 0.124 0.122 0.123 ...
##  $ Feature9  : num  1.5 1.5 1.44 1.49 1.5 ...
##  $ Feature10 : num  0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
##  $ Feature11 : num  -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
##  $ Feature12 : num  0.946 0.963 0.962 0.963 0.957 ...
##  $ Feature13 : num  202 201 202 202 202 ...
##  $ Feature14 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature15 : num  7.96 10.15 9.52 9.61 10.57 ...
##  $ Feature16 : num  415 415 417 422 421 ...
##  $ Feature17 : num  10.04 9.26 9.31 9.69 10.34 ...
##  $ Feature18 : num  0.968 0.97 0.967 0.969 0.974 ...
##  $ Feature19 : num  192 191 193 192 192 ...
##  $ Feature20 : num  12.5 12.5 12.5 12.5 12.5 ...
##  $ Feature21 : num  1.4 1.38 1.41 1.4 1.39 ...
##  $ Feature22 : num  -5419 -5442 -5448 -5468 -5476 ...
##  $ Feature23 : num  2916 2604 2702 2648 2635 ...
##  $ Feature24 : num  -4044 -3499 -4047 -4515 -3988 ...
##  $ Feature25 : num  751 -1640 -1916 -1657 117 ...
##  $ Feature26 : num  0.895 1.297 1.312 1.314 1.289 ...
##  $ Feature27 : num  1.77 2.01 2.03 2 1.99 ...
##  $ Feature28 : num  3.05 7.39 7.58 7.31 7.27 ...
##  $ Feature29 : num  64.2 68.4 67.1 62.9 62.8 ...
##  $ Feature30 : num  2.02 2.27 2.33 2.64 3.16 ...
##  $ Feature31 : num  0.163 0.21 0.173 0.207 0.27 ...
##  $ Feature32 : num  3.52 3.42 3.6 3.38 3.27 ...
##  $ Feature33 : num  83.4 84.9 84.8 84.9 86.3 ...
##  $ Feature34 : num  9.51 9.8 8.66 8.68 8.77 ...
##  $ Feature35 : num  50.6 50.7 50.2 50.5 50.2 ...
##  $ Feature36 : num  64.3 64.3 64.1 64.1 64.2 ...
##  $ Feature37 : num  49.4 49.3 49.8 49.5 49.8 ...
##  $ Feature38 : num  66.3 64.9 65.8 65.2 66.2 ...
##  $ Feature39 : num  87 87.5 84.7 86.7 86.1 ...
##  $ Feature40 : num  118 118 119 117 121 ...
##  $ Feature41 : num  61.3 78.2 14.4 76.9 76.4 ...
##  $ Feature42 : num  4.51 2.77 5.43 1.28 2.21 ...
##  $ Feature43 : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ Feature44 : num  353 352 364 363 353 ...
##  $ Feature45 : num  10.18 10.04 9.88 9.93 10.41 ...
##  $ Feature46 : num  130 133 132 132 176 ...
##  $ Feature47 : num  723 725 735 734 790 ...
##  $ Feature48 : num  1.31 1.29 1.3 1.3 1.03 ...
##  $ Feature49 : num  141 146 141 143 138 ...
##  $ Feature50 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature51 : num  624 631 637 637 668 ...
##  $ Feature52 : num  218 205 186 190 234 ...
##  $ Feature53 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature54 : num  4.59 4.59 4.49 4.49 4.62 ...
##  $ Feature55 : num  4.84 4.84 4.75 4.75 4.89 ...
##  $ Feature56 : int  2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
##  $ Feature57 : num  0.932 0.932 0.914 0.914 0.93 ...
##  $ Feature58 : num  0.948 0.948 0.945 0.945 0.945 ...
##  $ Feature59 : num  4.71 4.68 4.59 4.59 4.64 ...
##  $ Feature60 : num  -1.726 0.807 23.825 24.379 -12.294 ...
##  $ Feature61 : num  351 352 365 361 355 ...
##  $ Feature62 : num  10.62 10.31 10.17 10.21 9.79 ...
##  $ Feature63 : num  109 114 116 116 144 ...
##  $ Feature64 : num  16.1 10.9 11.3 13.6 22 ...
##  $ Feature65 : num  21.7 19.2 16.2 15.6 32.3 ...
##  $ Feature66 : num  29.5 27.6 24.3 23.5 44.1 ...
##  $ Feature67 : num  694 697 711 710 746 ...
##  $ Feature68 : num  0.923 1.16 0.869 0.976 0.926 ...
##  $ Feature69 : num  149 154 146 148 147 ...
##  $ Feature70 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature71 : num  608 620 626 625 646 ...
##  $ Feature72 : num  84.1 82.3 84.8 70.2 65.8 ...
##  $ Feature73 : num  NA NA 141 160 NA ...
##  $ Feature74 : num  NA NA 485 465 NA ...
##  $ Feature75 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature76 : num  0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
##  $ Feature77 : num  -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
##  $ Feature78 : num  0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
##  $ Feature79 : num  -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
##  $ Feature80 : num  -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
##  $ Feature81 : num  -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
##  $ Feature82 : num  -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
##  $ Feature83 : num  -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
##  $ Feature84 : num  7.22 6.8 7.1 7.59 7.5 ...
##  $ Feature85 : num  0.132 0.136 0.136 0.13 0.134 ...
##  $ Feature86 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Feature87 : num  2.39 2.38 2.45 2.4 2.45 ...
##  $ Feature88 : num  0.969 0.989 0.988 0.99 0.99 ...
##  $ Feature89 : num  1748 1932 1686 1752 1828 ...
##  $ Feature90 : num  0.184 0.187 0.15 0.196 0.183 ...
##  $ Feature91 : num  8672 8407 9317 8206 9014 ...
##  $ Feature92 : num  -0.3274 0.1455 0.0553 0.0697 0.0448 ...
##  $ Feature93 : num  -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
##  $ Feature94 : num  -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
##  $ Feature95 : num  1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
##  $ Feature96 : num  3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
##  $ Feature97 : num  -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
##   [list output truncated]

590 predictors? And it’s a pass/fail condition. Could be. How do we handle this? Machine learning maybe. Thing is, this is a Bernoulli, or binomial distribution. In R,

newdat$Class<-as.factor(newdat$Class)
str(newdat)

## 'data.frame':    1567 obs. of  592 variables:
##  $ Class     : Factor w/ 2 levels "fail","pass": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Time      : chr  "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
##  $ Feature1  : num  3031 3096 2933 2989 3032 ...
##  $ Feature2  : num  2564 2465 2560 2480 2503 ...
##  $ Feature3  : num  2188 2230 2186 2199 2233 ...
##  $ Feature4  : num  1411 1464 1698 910 1327 ...
##  $ Feature5  : num  1.36 0.829 1.51 1.32 1.533 ...
##  $ Feature6  : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ Feature7  : num  97.6 102.3 95.5 104.2 100.4 ...
##  $ Feature8  : num  0.124 0.125 0.124 0.122 0.123 ...
##  $ Feature9  : num  1.5 1.5 1.44 1.49 1.5 ...
##  $ Feature10 : num  0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
##  $ Feature11 : num  -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
##  $ Feature12 : num  0.946 0.963 0.962 0.963 0.957 ...
##  $ Feature13 : num  202 201 202 202 202 ...
##  $ Feature14 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature15 : num  7.96 10.15 9.52 9.61 10.57 ...
##  $ Feature16 : num  415 415 417 422 421 ...
##  $ Feature17 : num  10.04 9.26 9.31 9.69 10.34 ...
##  $ Feature18 : num  0.968 0.97 0.967 0.969 0.974 ...
##  $ Feature19 : num  192 191 193 192 192 ...
##  $ Feature20 : num  12.5 12.5 12.5 12.5 12.5 ...
##  $ Feature21 : num  1.4 1.38 1.41 1.4 1.39 ...
##  $ Feature22 : num  -5419 -5442 -5448 -5468 -5476 ...
##  $ Feature23 : num  2916 2604 2702 2648 2635 ...
##  $ Feature24 : num  -4044 -3499 -4047 -4515 -3988 ...
##  $ Feature25 : num  751 -1640 -1916 -1657 117 ...
##  $ Feature26 : num  0.895 1.297 1.312 1.314 1.289 ...
##  $ Feature27 : num  1.77 2.01 2.03 2 1.99 ...
##  $ Feature28 : num  3.05 7.39 7.58 7.31 7.27 ...
##  $ Feature29 : num  64.2 68.4 67.1 62.9 62.8 ...
##  $ Feature30 : num  2.02 2.27 2.33 2.64 3.16 ...
##  $ Feature31 : num  0.163 0.21 0.173 0.207 0.27 ...
##  $ Feature32 : num  3.52 3.42 3.6 3.38 3.27 ...
##  $ Feature33 : num  83.4 84.9 84.8 84.9 86.3 ...
##  $ Feature34 : num  9.51 9.8 8.66 8.68 8.77 ...
##  $ Feature35 : num  50.6 50.7 50.2 50.5 50.2 ...
##  $ Feature36 : num  64.3 64.3 64.1 64.1 64.2 ...
##  $ Feature37 : num  49.4 49.3 49.8 49.5 49.8 ...
##  $ Feature38 : num  66.3 64.9 65.8 65.2 66.2 ...
##  $ Feature39 : num  87 87.5 84.7 86.7 86.1 ...
##  $ Feature40 : num  118 118 119 117 121 ...
##  $ Feature41 : num  61.3 78.2 14.4 76.9 76.4 ...
##  $ Feature42 : num  4.51 2.77 5.43 1.28 2.21 ...
##  $ Feature43 : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ Feature44 : num  353 352 364 363 353 ...
##  $ Feature45 : num  10.18 10.04 9.88 9.93 10.41 ...
##  $ Feature46 : num  130 133 132 132 176 ...
##  $ Feature47 : num  723 725 735 734 790 ...
##  $ Feature48 : num  1.31 1.29 1.3 1.3 1.03 ...
##  $ Feature49 : num  141 146 141 143 138 ...
##  $ Feature50 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature51 : num  624 631 637 637 668 ...
##  $ Feature52 : num  218 205 186 190 234 ...
##  $ Feature53 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature54 : num  4.59 4.59 4.49 4.49 4.62 ...
##  $ Feature55 : num  4.84 4.84 4.75 4.75 4.89 ...
##  $ Feature56 : int  2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
##  $ Feature57 : num  0.932 0.932 0.914 0.914 0.93 ...
##  $ Feature58 : num  0.948 0.948 0.945 0.945 0.945 ...
##  $ Feature59 : num  4.71 4.68 4.59 4.59 4.64 ...
##  $ Feature60 : num  -1.726 0.807 23.825 24.379 -12.294 ...
##  $ Feature61 : num  351 352 365 361 355 ...
##  $ Feature62 : num  10.62 10.31 10.17 10.21 9.79 ...
##  $ Feature63 : num  109 114 116 116 144 ...
##  $ Feature64 : num  16.1 10.9 11.3 13.6 22 ...
##  $ Feature65 : num  21.7 19.2 16.2 15.6 32.3 ...
##  $ Feature66 : num  29.5 27.6 24.3 23.5 44.1 ...
##  $ Feature67 : num  694 697 711 710 746 ...
##  $ Feature68 : num  0.923 1.16 0.869 0.976 0.926 ...
##  $ Feature69 : num  149 154 146 148 147 ...
##  $ Feature70 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature71 : num  608 620 626 625 646 ...
##  $ Feature72 : num  84.1 82.3 84.8 70.2 65.8 ...
##  $ Feature73 : num  NA NA 141 160 NA ...
##  $ Feature74 : num  NA NA 485 465 NA ...
##  $ Feature75 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature76 : num  0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
##  $ Feature77 : num  -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
##  $ Feature78 : num  0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
##  $ Feature79 : num  -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
##  $ Feature80 : num  -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
##  $ Feature81 : num  -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
##  $ Feature82 : num  -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
##  $ Feature83 : num  -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
##  $ Feature84 : num  7.22 6.8 7.1 7.59 7.5 ...
##  $ Feature85 : num  0.132 0.136 0.136 0.13 0.134 ...
##  $ Feature86 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Feature87 : num  2.39 2.38 2.45 2.4 2.45 ...
##  $ Feature88 : num  0.969 0.989 0.988 0.99 0.99 ...
##  $ Feature89 : num  1748 1932 1686 1752 1828 ...
##  $ Feature90 : num  0.184 0.187 0.15 0.196 0.183 ...
##  $ Feature91 : num  8672 8407 9317 8206 9014 ...
##  $ Feature92 : num  -0.3274 0.1455 0.0553 0.0697 0.0448 ...
##  $ Feature93 : num  -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
##  $ Feature94 : num  -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
##  $ Feature95 : num  1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
##  $ Feature96 : num  3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
##  $ Feature97 : num  -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
##   [list output truncated]

contrasts(as.factor(newdat$Class))

##      pass
## fail    0
## pass    1

str(newdat)

## 'data.frame':    1567 obs. of  592 variables:
##  $ Class     : Factor w/ 2 levels "fail","pass": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Time      : chr  "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
##  $ Feature1  : num  3031 3096 2933 2989 3032 ...
##  $ Feature2  : num  2564 2465 2560 2480 2503 ...
##  $ Feature3  : num  2188 2230 2186 2199 2233 ...
##  $ Feature4  : num  1411 1464 1698 910 1327 ...
##  $ Feature5  : num  1.36 0.829 1.51 1.32 1.533 ...
##  $ Feature6  : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ Feature7  : num  97.6 102.3 95.5 104.2 100.4 ...
##  $ Feature8  : num  0.124 0.125 0.124 0.122 0.123 ...
##  $ Feature9  : num  1.5 1.5 1.44 1.49 1.5 ...
##  $ Feature10 : num  0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
##  $ Feature11 : num  -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
##  $ Feature12 : num  0.946 0.963 0.962 0.963 0.957 ...
##  $ Feature13 : num  202 201 202 202 202 ...
##  $ Feature14 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature15 : num  7.96 10.15 9.52 9.61 10.57 ...
##  $ Feature16 : num  415 415 417 422 421 ...
##  $ Feature17 : num  10.04 9.26 9.31 9.69 10.34 ...
##  $ Feature18 : num  0.968 0.97 0.967 0.969 0.974 ...
##  $ Feature19 : num  192 191 193 192 192 ...
##  $ Feature20 : num  12.5 12.5 12.5 12.5 12.5 ...
##  $ Feature21 : num  1.4 1.38 1.41 1.4 1.39 ...
##  $ Feature22 : num  -5419 -5442 -5448 -5468 -5476 ...
##  $ Feature23 : num  2916 2604 2702 2648 2635 ...
##  $ Feature24 : num  -4044 -3499 -4047 -4515 -3988 ...
##  $ Feature25 : num  751 -1640 -1916 -1657 117 ...
##  $ Feature26 : num  0.895 1.297 1.312 1.314 1.289 ...
##  $ Feature27 : num  1.77 2.01 2.03 2 1.99 ...
##  $ Feature28 : num  3.05 7.39 7.58 7.31 7.27 ...
##  $ Feature29 : num  64.2 68.4 67.1 62.9 62.8 ...
##  $ Feature30 : num  2.02 2.27 2.33 2.64 3.16 ...
##  $ Feature31 : num  0.163 0.21 0.173 0.207 0.27 ...
##  $ Feature32 : num  3.52 3.42 3.6 3.38 3.27 ...
##  $ Feature33 : num  83.4 84.9 84.8 84.9 86.3 ...
##  $ Feature34 : num  9.51 9.8 8.66 8.68 8.77 ...
##  $ Feature35 : num  50.6 50.7 50.2 50.5 50.2 ...
##  $ Feature36 : num  64.3 64.3 64.1 64.1 64.2 ...
##  $ Feature37 : num  49.4 49.3 49.8 49.5 49.8 ...
##  $ Feature38 : num  66.3 64.9 65.8 65.2 66.2 ...
##  $ Feature39 : num  87 87.5 84.7 86.7 86.1 ...
##  $ Feature40 : num  118 118 119 117 121 ...
##  $ Feature41 : num  61.3 78.2 14.4 76.9 76.4 ...
##  $ Feature42 : num  4.51 2.77 5.43 1.28 2.21 ...
##  $ Feature43 : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ Feature44 : num  353 352 364 363 353 ...
##  $ Feature45 : num  10.18 10.04 9.88 9.93 10.41 ...
##  $ Feature46 : num  130 133 132 132 176 ...
##  $ Feature47 : num  723 725 735 734 790 ...
##  $ Feature48 : num  1.31 1.29 1.3 1.3 1.03 ...
##  $ Feature49 : num  141 146 141 143 138 ...
##  $ Feature50 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature51 : num  624 631 637 637 668 ...
##  $ Feature52 : num  218 205 186 190 234 ...
##  $ Feature53 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature54 : num  4.59 4.59 4.49 4.49 4.62 ...
##  $ Feature55 : num  4.84 4.84 4.75 4.75 4.89 ...
##  $ Feature56 : int  2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
##  $ Feature57 : num  0.932 0.932 0.914 0.914 0.93 ...
##  $ Feature58 : num  0.948 0.948 0.945 0.945 0.945 ...
##  $ Feature59 : num  4.71 4.68 4.59 4.59 4.64 ...
##  $ Feature60 : num  -1.726 0.807 23.825 24.379 -12.294 ...
##  $ Feature61 : num  351 352 365 361 355 ...
##  $ Feature62 : num  10.62 10.31 10.17 10.21 9.79 ...
##  $ Feature63 : num  109 114 116 116 144 ...
##  $ Feature64 : num  16.1 10.9 11.3 13.6 22 ...
##  $ Feature65 : num  21.7 19.2 16.2 15.6 32.3 ...
##  $ Feature66 : num  29.5 27.6 24.3 23.5 44.1 ...
##  $ Feature67 : num  694 697 711 710 746 ...
##  $ Feature68 : num  0.923 1.16 0.869 0.976 0.926 ...
##  $ Feature69 : num  149 154 146 148 147 ...
##  $ Feature70 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Feature71 : num  608 620 626 625 646 ...
##  $ Feature72 : num  84.1 82.3 84.8 70.2 65.8 ...
##  $ Feature73 : num  NA NA 141 160 NA ...
##  $ Feature74 : num  NA NA 485 465 NA ...
##  $ Feature75 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Feature76 : num  0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
##  $ Feature77 : num  -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
##  $ Feature78 : num  0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
##  $ Feature79 : num  -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
##  $ Feature80 : num  -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
##  $ Feature81 : num  -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
##  $ Feature82 : num  -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
##  $ Feature83 : num  -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
##  $ Feature84 : num  7.22 6.8 7.1 7.59 7.5 ...
##  $ Feature85 : num  0.132 0.136 0.136 0.13 0.134 ...
##  $ Feature86 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Feature87 : num  2.39 2.38 2.45 2.4 2.45 ...
##  $ Feature88 : num  0.969 0.989 0.988 0.99 0.99 ...
##  $ Feature89 : num  1748 1932 1686 1752 1828 ...
##  $ Feature90 : num  0.184 0.187 0.15 0.196 0.183 ...
##  $ Feature91 : num  8672 8407 9317 8206 9014 ...
##  $ Feature92 : num  -0.3274 0.1455 0.0553 0.0697 0.0448 ...
##  $ Feature93 : num  -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
##  $ Feature94 : num  -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
##  $ Feature95 : num  1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
##  $ Feature96 : num  3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
##  $ Feature97 : num  -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
##   [list output truncated]

Will use generalized linear model.

model<-glm(Class~Feature1+Feature2+Feature3,family=binomial, data=newdat)
model$coefficients

##   (Intercept)      Feature1      Feature2      Feature3 
## -2.4963994165  0.0013849361  0.0001795779  0.0002330647

summary(model)

## 
## Call:
## glm(formula = Class ~ Feature1 + Feature2 + Feature3, family = binomial, 
##     data = newdat)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4963994  9.5500232  -0.261    0.794
## Feature1     0.0013849  0.0014292   0.969    0.333
## Feature2     0.0001796  0.0013108   0.137    0.891
## Feature3     0.0002331  0.0034634   0.067    0.946
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 756.14  on 1539  degrees of freedom
## Residual deviance: 755.18  on 1536  degrees of freedom
##   (27 observations deleted due to missingness)
## AIC: 763.18
## 
## Number of Fisher Scoring iterations: 5

This requires logistic regression:

We would really like to assume constant variance. This is very important. About the residuals - this is the difference between the model prediction and the observed value. The epsilons are this difference. We are assuming they are normal zero sigma squared. The third step in our regression is to assess the adequacy of the model through examining the residuals. Let’s take a normal probability plot of the residuals. And we’re going to plot residuals against predicted values. The normal probability plot should be a straight line. THe plot of residuals versus predicted values should not show any structure. Why plots? Going back to the linear model.

plot(model3)

The first plot is the residuals versus model value. There is essentially no structure. The second is the QQ residuals, which are essentially a straight line. Scale location tells something about outliers. The last plot indicates what effect outliers have on the fit.

Let’s compare:

#check for non-constant variance
par(mfcol=c(1,3))
plot(model1,1)
plot(model2,1)
plot(model3,1)

#check for normality
par(mfcol=c(1,3))
plot(model1,2)
plot(model2,2)
plot(model3,2)

par(mfcol=c(1,3))
plot(model1,3)
plot(model2,3)
plot(model3,3)

par(mfcol=c(1,3))
plot(model1,5)
plot(model2,5)
plot(model3,5)

What if we’re not happy with these results? Try some transformations.

Outliers that are near or exceed the range of the predictor will have a larger effect on the slope than outliers near the center of the range.

$We like to use anova to assess how well the model fits the data.$

We like to use anova to assess how well the model fits the data.

The sum of squares total will be the sum of all yi - ybar

The sum of squares regression is the sum of all ypred - ybar

The sum of squares error is the sum of all yi-ypred

R2 is the ratio of SSregression/SStotal. If the regression explains the data exactly, the SSE disappears and R2 is 1.

What’s the p value? Remember table for ANOVA? The column labeled f is the MSregression/MSE, so a really big p value says the error is small compared to the MS regression.

R2 can be misleading. If you put in enough adjustable parameters you can make R2 as close to 1 as you like. Adjusted R2 will show whether teh factors really have an effect.

Look at summaries of the fits.

summary(model1)

## 
## Call:
## lm(formula = Strength ~ Length, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8889 -1.3199  0.3547  2.0030  5.8314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.114      1.146   4.464 0.000177 ***
## Length         2.903      0.117  24.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.093 on 23 degrees of freedom
## Multiple R-squared:  0.964,  Adjusted R-squared:  0.9624 
## F-statistic: 615.1 on 1 and 23 DF,  p-value: < 2.2e-16

P value is tiny and R2 is very good.

summary(model2)

## 
## Call:
## lm(formula = Strength ~ Height, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.774  -7.235  -1.856   5.582  28.272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 14.56780    6.03259   2.415   0.0241 *
## Height       0.04360    0.01605   2.717   0.0123 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 23 degrees of freedom
## Multiple R-squared:  0.2429, Adjusted R-squared:   0.21 
## F-statistic:  7.38 on 1 and 23 DF,  p-value: 0.01231

p value says it’s significant but R2 is lousy. Fit line is pretty horizontal (p value) but there’s a lot of spread about the fit line (R2)

summary(model3)

## 
## Call:
## lm(formula = Strength ~ Length + Height + Length:Height, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8113 -1.1029 -0.0058  0.5374  5.1170 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6029761  1.4158295   4.664 0.000133 ***
## Length        2.1243159  0.1791705  11.856 9.09e-11 ***
## Height        0.0010646  0.0037394   0.285 0.778665    
## Length:Height 0.0014811  0.0003901   3.796 0.001056 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.803 on 21 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9872 
## F-statistic: 618.8 on 3 and 21 DF,  p-value: < 2.2e-16

Hey - this is brilliant, tiny p value, large R2. But we should expect R2 to go up with additional fit parameter.

Looking at model3, can’t throw out height because of the significance of the interaction, even though height by itself is not very important. Many times this analysis will be repeated many times.

Let’s look at confidence and prediction intervals:

predict(model3, newdata=data.frame(Length=3, Height=300),
        interval="confidence")

##        fit      lwr      upr
## 1 14.62829 13.54091 15.71567

predict(model3, newdata=data.frame(Length=3, Height=300),
        interval="prediction")

##        fit      lwr      upr
## 1 14.62829 10.72339 18.53319

newdf<-data.frame(
  Length=c(3,5),
  Height=c(300,200))
data.frame(newdf,predict(model3,
                         newdata=newdf,
                         interval="confidence"))

##   Length Height      fit      lwr      upr
## 1      3    300 14.62829 13.54091 15.71567
## 2      5    200 18.91857 17.91739 19.91975

There’s a significant difference.

Matis Lecture 8

Richard Gale

2024-04-10