library(qcc)
## Warning: package 'qcc' was built under R version 4.3.3
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
library(stats)
library(MASS)
Ok, lecture 9 in Tim Matis series on statistical analysis/DoE.
There’s some error in our model. That’s the epsilon. Linear regression is linear in the regression, not in the model. The model can be nonlinear. Do we ever do nonlinear regression? Yes, but it’s very specialized, complicated and rare. Simplest function we can write y = f(x) +e. For a linear function, we have y = mx +b, usually written as y = beta0 + beta1 x + epsilon. Slope, intercept, and error. What if we have other predictors? beta1, beta2, etc. with arguments x1, x2. Might need to have interaction terms again. As long as the beta’s only appear as power 1 this is a linear regression function. What’s not? y= beta0^2 or exp(beta1).
Step 1 - write the model
Step 2 - fit the model, i.e., estimate the parameters, the betas
Two ways to do this:
Least squares is just calculus, doesn’t assume normality in the data or in the error. Maximum likelihood requires that we know the distribution of the response. It’s not hard to understand ML, but math is nasty.
Least squares:
Take partial derivatives with respect to all betas. We have n equations in n unknowns. A different way for slr is least squares. Sum the squares of the deviations and find the betas that makes that a minimum.
So, make a data table:
Y1 1 x1 x2 x3 x4 xk
Y2 1 x1 x2 x3 x4 xk
solution is betahat = (xtranspose x)^-1 xtranspose y
always get a line then figure out how good is it.
Now, what about maximum likelihood?
For ML need to know the distribution. Assuming the observations are independent, the joint probability distribution is the product of all the individual probability distributions. Take partial derivatives, set equal to zero and solve simultaneous equations. Ok, an example:
For this example, wire length and die height are predictors of pull strength.
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/WireBondSemiConductorMfg.csv")
head(dat)
## Pull.Strength..Y. Wire.Length..X1. Die.Height..X2.
## 1 9.95 2 50
## 2 24.45 8 110
## 3 31.75 11 120
## 4 35.00 10 550
## 5 25.02 8 295
## 6 16.86 4 200
colnames(dat)<-c("Strength","Length", "Height")
head(dat)
## Strength Length Height
## 1 9.95 2 50
## 2 24.45 8 110
## 3 31.75 11 120
## 4 35.00 10 550
## 5 25.02 8 295
## 6 16.86 4 200
plot(dat$Length, dat$Strength,
main="Wire Bond: Pull Strength vs. Wire Length",
xlab="Length",
ylab="Strength")
plot(dat$Height, dat$Strength,
main="Wire Bond: Pull Strength vs. Die Height",
xlab="Height",
ylab="Strength")
model1<-lm(Strength~Length, data=dat)
plot(dat$Length, dat$Strength,
main="Wire Bond: Pull Strength vs. Wire Length",
xlab="Length",
ylab="Strength")
abline(model1)
model1$coefficients
## (Intercept) Length
## 5.114516 2.902704
summary(model1)
##
## Call:
## lm(formula = Strength ~ Length, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8889 -1.3199 0.3547 2.0030 5.8314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.114 1.146 4.464 0.000177 ***
## Length 2.903 0.117 24.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.093 on 23 degrees of freedom
## Multiple R-squared: 0.964, Adjusted R-squared: 0.9624
## F-statistic: 615.1 on 1 and 23 DF, p-value: < 2.2e-16
model2<-lm(Strength~Height,data=dat)
plot(dat$Height, dat$Strength,
main="Wire Bond: Pull Strength vs. Die Height",
xlab="Height",
ylab="Strength")
abline(model2)
model2$coefficients
## (Intercept) Height
## 14.56780276 0.04360079
summary(model2)
##
## Call:
## lm(formula = Strength ~ Height, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.774 -7.235 -1.856 5.582 28.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.56780 6.03259 2.415 0.0241 *
## Height 0.04360 0.01605 2.717 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.18 on 23 degrees of freedom
## Multiple R-squared: 0.2429, Adjusted R-squared: 0.21
## F-statistic: 7.38 on 1 and 23 DF, p-value: 0.01231
Now let’s try multiple regression:
model3<-lm(Strength~Length+Height+Length:Height,data=dat)
model3$coefficients
## (Intercept) Length Height Length:Height
## 6.602976132 2.124315904 0.001064586 0.001481100
summary(model3)
##
## Call:
## lm(formula = Strength ~ Length + Height + Length:Height, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8113 -1.1029 -0.0058 0.5374 5.1170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6029761 1.4158295 4.664 0.000133 ***
## Length 2.1243159 0.1791705 11.856 9.09e-11 ***
## Height 0.0010646 0.0037394 0.285 0.778665
## Length:Height 0.0014811 0.0003901 3.796 0.001056 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.803 on 21 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9872
## F-statistic: 618.8 on 3 and 21 DF, p-value: < 2.2e-16
plot(model3)
Insert TIm’s lm in R here
Now consider a different data set:
Tthis would be a bernoulli
experiment when the response is pass/fail or 0/1 , etc.
newdat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/XFab_SECOM_LogisticRegression.csv")
nrow(newdat)
## [1] 1567
ncol(newdat)
## [1] 592
str(newdat)
## 'data.frame': 1567 obs. of 592 variables:
## $ Class : chr "pass" "pass" "fail" "pass" ...
## $ Time : chr "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
## $ Feature1 : num 3031 3096 2933 2989 3032 ...
## $ Feature2 : num 2564 2465 2560 2480 2503 ...
## $ Feature3 : num 2188 2230 2186 2199 2233 ...
## $ Feature4 : num 1411 1464 1698 910 1327 ...
## $ Feature5 : num 1.36 0.829 1.51 1.32 1.533 ...
## $ Feature6 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ Feature7 : num 97.6 102.3 95.5 104.2 100.4 ...
## $ Feature8 : num 0.124 0.125 0.124 0.122 0.123 ...
## $ Feature9 : num 1.5 1.5 1.44 1.49 1.5 ...
## $ Feature10 : num 0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
## $ Feature11 : num -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
## $ Feature12 : num 0.946 0.963 0.962 0.963 0.957 ...
## $ Feature13 : num 202 201 202 202 202 ...
## $ Feature14 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature15 : num 7.96 10.15 9.52 9.61 10.57 ...
## $ Feature16 : num 415 415 417 422 421 ...
## $ Feature17 : num 10.04 9.26 9.31 9.69 10.34 ...
## $ Feature18 : num 0.968 0.97 0.967 0.969 0.974 ...
## $ Feature19 : num 192 191 193 192 192 ...
## $ Feature20 : num 12.5 12.5 12.5 12.5 12.5 ...
## $ Feature21 : num 1.4 1.38 1.41 1.4 1.39 ...
## $ Feature22 : num -5419 -5442 -5448 -5468 -5476 ...
## $ Feature23 : num 2916 2604 2702 2648 2635 ...
## $ Feature24 : num -4044 -3499 -4047 -4515 -3988 ...
## $ Feature25 : num 751 -1640 -1916 -1657 117 ...
## $ Feature26 : num 0.895 1.297 1.312 1.314 1.289 ...
## $ Feature27 : num 1.77 2.01 2.03 2 1.99 ...
## $ Feature28 : num 3.05 7.39 7.58 7.31 7.27 ...
## $ Feature29 : num 64.2 68.4 67.1 62.9 62.8 ...
## $ Feature30 : num 2.02 2.27 2.33 2.64 3.16 ...
## $ Feature31 : num 0.163 0.21 0.173 0.207 0.27 ...
## $ Feature32 : num 3.52 3.42 3.6 3.38 3.27 ...
## $ Feature33 : num 83.4 84.9 84.8 84.9 86.3 ...
## $ Feature34 : num 9.51 9.8 8.66 8.68 8.77 ...
## $ Feature35 : num 50.6 50.7 50.2 50.5 50.2 ...
## $ Feature36 : num 64.3 64.3 64.1 64.1 64.2 ...
## $ Feature37 : num 49.4 49.3 49.8 49.5 49.8 ...
## $ Feature38 : num 66.3 64.9 65.8 65.2 66.2 ...
## $ Feature39 : num 87 87.5 84.7 86.7 86.1 ...
## $ Feature40 : num 118 118 119 117 121 ...
## $ Feature41 : num 61.3 78.2 14.4 76.9 76.4 ...
## $ Feature42 : num 4.51 2.77 5.43 1.28 2.21 ...
## $ Feature43 : int 70 70 70 70 70 70 70 70 70 70 ...
## $ Feature44 : num 353 352 364 363 353 ...
## $ Feature45 : num 10.18 10.04 9.88 9.93 10.41 ...
## $ Feature46 : num 130 133 132 132 176 ...
## $ Feature47 : num 723 725 735 734 790 ...
## $ Feature48 : num 1.31 1.29 1.3 1.3 1.03 ...
## $ Feature49 : num 141 146 141 143 138 ...
## $ Feature50 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature51 : num 624 631 637 637 668 ...
## $ Feature52 : num 218 205 186 190 234 ...
## $ Feature53 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature54 : num 4.59 4.59 4.49 4.49 4.62 ...
## $ Feature55 : num 4.84 4.84 4.75 4.75 4.89 ...
## $ Feature56 : int 2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
## $ Feature57 : num 0.932 0.932 0.914 0.914 0.93 ...
## $ Feature58 : num 0.948 0.948 0.945 0.945 0.945 ...
## $ Feature59 : num 4.71 4.68 4.59 4.59 4.64 ...
## $ Feature60 : num -1.726 0.807 23.825 24.379 -12.294 ...
## $ Feature61 : num 351 352 365 361 355 ...
## $ Feature62 : num 10.62 10.31 10.17 10.21 9.79 ...
## $ Feature63 : num 109 114 116 116 144 ...
## $ Feature64 : num 16.1 10.9 11.3 13.6 22 ...
## $ Feature65 : num 21.7 19.2 16.2 15.6 32.3 ...
## $ Feature66 : num 29.5 27.6 24.3 23.5 44.1 ...
## $ Feature67 : num 694 697 711 710 746 ...
## $ Feature68 : num 0.923 1.16 0.869 0.976 0.926 ...
## $ Feature69 : num 149 154 146 148 147 ...
## $ Feature70 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature71 : num 608 620 626 625 646 ...
## $ Feature72 : num 84.1 82.3 84.8 70.2 65.8 ...
## $ Feature73 : num NA NA 141 160 NA ...
## $ Feature74 : num NA NA 485 465 NA ...
## $ Feature75 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature76 : num 0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
## $ Feature77 : num -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
## $ Feature78 : num 0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
## $ Feature79 : num -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
## $ Feature80 : num -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
## $ Feature81 : num -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
## $ Feature82 : num -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
## $ Feature83 : num -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
## $ Feature84 : num 7.22 6.8 7.1 7.59 7.5 ...
## $ Feature85 : num 0.132 0.136 0.136 0.13 0.134 ...
## $ Feature86 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Feature87 : num 2.39 2.38 2.45 2.4 2.45 ...
## $ Feature88 : num 0.969 0.989 0.988 0.99 0.99 ...
## $ Feature89 : num 1748 1932 1686 1752 1828 ...
## $ Feature90 : num 0.184 0.187 0.15 0.196 0.183 ...
## $ Feature91 : num 8672 8407 9317 8206 9014 ...
## $ Feature92 : num -0.3274 0.1455 0.0553 0.0697 0.0448 ...
## $ Feature93 : num -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
## $ Feature94 : num -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
## $ Feature95 : num 1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
## $ Feature96 : num 3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
## $ Feature97 : num -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
## [list output truncated]
590 predictors? And it’s a pass/fail condition. Could be. How do we handle this? Machine learning maybe. Thing is, this is a Bernoulli, or binomial distribution. In R,
newdat$Class<-as.factor(newdat$Class)
str(newdat)
## 'data.frame': 1567 obs. of 592 variables:
## $ Class : Factor w/ 2 levels "fail","pass": 2 2 1 2 2 2 2 2 2 2 ...
## $ Time : chr "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
## $ Feature1 : num 3031 3096 2933 2989 3032 ...
## $ Feature2 : num 2564 2465 2560 2480 2503 ...
## $ Feature3 : num 2188 2230 2186 2199 2233 ...
## $ Feature4 : num 1411 1464 1698 910 1327 ...
## $ Feature5 : num 1.36 0.829 1.51 1.32 1.533 ...
## $ Feature6 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ Feature7 : num 97.6 102.3 95.5 104.2 100.4 ...
## $ Feature8 : num 0.124 0.125 0.124 0.122 0.123 ...
## $ Feature9 : num 1.5 1.5 1.44 1.49 1.5 ...
## $ Feature10 : num 0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
## $ Feature11 : num -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
## $ Feature12 : num 0.946 0.963 0.962 0.963 0.957 ...
## $ Feature13 : num 202 201 202 202 202 ...
## $ Feature14 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature15 : num 7.96 10.15 9.52 9.61 10.57 ...
## $ Feature16 : num 415 415 417 422 421 ...
## $ Feature17 : num 10.04 9.26 9.31 9.69 10.34 ...
## $ Feature18 : num 0.968 0.97 0.967 0.969 0.974 ...
## $ Feature19 : num 192 191 193 192 192 ...
## $ Feature20 : num 12.5 12.5 12.5 12.5 12.5 ...
## $ Feature21 : num 1.4 1.38 1.41 1.4 1.39 ...
## $ Feature22 : num -5419 -5442 -5448 -5468 -5476 ...
## $ Feature23 : num 2916 2604 2702 2648 2635 ...
## $ Feature24 : num -4044 -3499 -4047 -4515 -3988 ...
## $ Feature25 : num 751 -1640 -1916 -1657 117 ...
## $ Feature26 : num 0.895 1.297 1.312 1.314 1.289 ...
## $ Feature27 : num 1.77 2.01 2.03 2 1.99 ...
## $ Feature28 : num 3.05 7.39 7.58 7.31 7.27 ...
## $ Feature29 : num 64.2 68.4 67.1 62.9 62.8 ...
## $ Feature30 : num 2.02 2.27 2.33 2.64 3.16 ...
## $ Feature31 : num 0.163 0.21 0.173 0.207 0.27 ...
## $ Feature32 : num 3.52 3.42 3.6 3.38 3.27 ...
## $ Feature33 : num 83.4 84.9 84.8 84.9 86.3 ...
## $ Feature34 : num 9.51 9.8 8.66 8.68 8.77 ...
## $ Feature35 : num 50.6 50.7 50.2 50.5 50.2 ...
## $ Feature36 : num 64.3 64.3 64.1 64.1 64.2 ...
## $ Feature37 : num 49.4 49.3 49.8 49.5 49.8 ...
## $ Feature38 : num 66.3 64.9 65.8 65.2 66.2 ...
## $ Feature39 : num 87 87.5 84.7 86.7 86.1 ...
## $ Feature40 : num 118 118 119 117 121 ...
## $ Feature41 : num 61.3 78.2 14.4 76.9 76.4 ...
## $ Feature42 : num 4.51 2.77 5.43 1.28 2.21 ...
## $ Feature43 : int 70 70 70 70 70 70 70 70 70 70 ...
## $ Feature44 : num 353 352 364 363 353 ...
## $ Feature45 : num 10.18 10.04 9.88 9.93 10.41 ...
## $ Feature46 : num 130 133 132 132 176 ...
## $ Feature47 : num 723 725 735 734 790 ...
## $ Feature48 : num 1.31 1.29 1.3 1.3 1.03 ...
## $ Feature49 : num 141 146 141 143 138 ...
## $ Feature50 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature51 : num 624 631 637 637 668 ...
## $ Feature52 : num 218 205 186 190 234 ...
## $ Feature53 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature54 : num 4.59 4.59 4.49 4.49 4.62 ...
## $ Feature55 : num 4.84 4.84 4.75 4.75 4.89 ...
## $ Feature56 : int 2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
## $ Feature57 : num 0.932 0.932 0.914 0.914 0.93 ...
## $ Feature58 : num 0.948 0.948 0.945 0.945 0.945 ...
## $ Feature59 : num 4.71 4.68 4.59 4.59 4.64 ...
## $ Feature60 : num -1.726 0.807 23.825 24.379 -12.294 ...
## $ Feature61 : num 351 352 365 361 355 ...
## $ Feature62 : num 10.62 10.31 10.17 10.21 9.79 ...
## $ Feature63 : num 109 114 116 116 144 ...
## $ Feature64 : num 16.1 10.9 11.3 13.6 22 ...
## $ Feature65 : num 21.7 19.2 16.2 15.6 32.3 ...
## $ Feature66 : num 29.5 27.6 24.3 23.5 44.1 ...
## $ Feature67 : num 694 697 711 710 746 ...
## $ Feature68 : num 0.923 1.16 0.869 0.976 0.926 ...
## $ Feature69 : num 149 154 146 148 147 ...
## $ Feature70 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature71 : num 608 620 626 625 646 ...
## $ Feature72 : num 84.1 82.3 84.8 70.2 65.8 ...
## $ Feature73 : num NA NA 141 160 NA ...
## $ Feature74 : num NA NA 485 465 NA ...
## $ Feature75 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature76 : num 0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
## $ Feature77 : num -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
## $ Feature78 : num 0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
## $ Feature79 : num -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
## $ Feature80 : num -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
## $ Feature81 : num -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
## $ Feature82 : num -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
## $ Feature83 : num -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
## $ Feature84 : num 7.22 6.8 7.1 7.59 7.5 ...
## $ Feature85 : num 0.132 0.136 0.136 0.13 0.134 ...
## $ Feature86 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Feature87 : num 2.39 2.38 2.45 2.4 2.45 ...
## $ Feature88 : num 0.969 0.989 0.988 0.99 0.99 ...
## $ Feature89 : num 1748 1932 1686 1752 1828 ...
## $ Feature90 : num 0.184 0.187 0.15 0.196 0.183 ...
## $ Feature91 : num 8672 8407 9317 8206 9014 ...
## $ Feature92 : num -0.3274 0.1455 0.0553 0.0697 0.0448 ...
## $ Feature93 : num -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
## $ Feature94 : num -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
## $ Feature95 : num 1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
## $ Feature96 : num 3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
## $ Feature97 : num -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
## [list output truncated]
contrasts(as.factor(newdat$Class))
## pass
## fail 0
## pass 1
str(newdat)
## 'data.frame': 1567 obs. of 592 variables:
## $ Class : Factor w/ 2 levels "fail","pass": 2 2 1 2 2 2 2 2 2 2 ...
## $ Time : chr "7/19/08 11:55" "7/19/08 12:32" "7/19/08 13:17" "7/19/08 14:43" ...
## $ Feature1 : num 3031 3096 2933 2989 3032 ...
## $ Feature2 : num 2564 2465 2560 2480 2503 ...
## $ Feature3 : num 2188 2230 2186 2199 2233 ...
## $ Feature4 : num 1411 1464 1698 910 1327 ...
## $ Feature5 : num 1.36 0.829 1.51 1.32 1.533 ...
## $ Feature6 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ Feature7 : num 97.6 102.3 95.5 104.2 100.4 ...
## $ Feature8 : num 0.124 0.125 0.124 0.122 0.123 ...
## $ Feature9 : num 1.5 1.5 1.44 1.49 1.5 ...
## $ Feature10 : num 0.0162 -0.0005 0.0041 -0.0124 -0.0031 0.0167 -0.027 0.0157 0.0111 0.0159 ...
## $ Feature11 : num -0.0034 -0.0148 0.0013 -0.0033 -0.0072 0.0055 0.0105 0.0007 -0.0066 0.0049 ...
## $ Feature12 : num 0.946 0.963 0.962 0.963 0.957 ...
## $ Feature13 : num 202 201 202 202 202 ...
## $ Feature14 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature15 : num 7.96 10.15 9.52 9.61 10.57 ...
## $ Feature16 : num 415 415 417 422 421 ...
## $ Feature17 : num 10.04 9.26 9.31 9.69 10.34 ...
## $ Feature18 : num 0.968 0.97 0.967 0.969 0.974 ...
## $ Feature19 : num 192 191 193 192 192 ...
## $ Feature20 : num 12.5 12.5 12.5 12.5 12.5 ...
## $ Feature21 : num 1.4 1.38 1.41 1.4 1.39 ...
## $ Feature22 : num -5419 -5442 -5448 -5468 -5476 ...
## $ Feature23 : num 2916 2604 2702 2648 2635 ...
## $ Feature24 : num -4044 -3499 -4047 -4515 -3988 ...
## $ Feature25 : num 751 -1640 -1916 -1657 117 ...
## $ Feature26 : num 0.895 1.297 1.312 1.314 1.289 ...
## $ Feature27 : num 1.77 2.01 2.03 2 1.99 ...
## $ Feature28 : num 3.05 7.39 7.58 7.31 7.27 ...
## $ Feature29 : num 64.2 68.4 67.1 62.9 62.8 ...
## $ Feature30 : num 2.02 2.27 2.33 2.64 3.16 ...
## $ Feature31 : num 0.163 0.21 0.173 0.207 0.27 ...
## $ Feature32 : num 3.52 3.42 3.6 3.38 3.27 ...
## $ Feature33 : num 83.4 84.9 84.8 84.9 86.3 ...
## $ Feature34 : num 9.51 9.8 8.66 8.68 8.77 ...
## $ Feature35 : num 50.6 50.7 50.2 50.5 50.2 ...
## $ Feature36 : num 64.3 64.3 64.1 64.1 64.2 ...
## $ Feature37 : num 49.4 49.3 49.8 49.5 49.8 ...
## $ Feature38 : num 66.3 64.9 65.8 65.2 66.2 ...
## $ Feature39 : num 87 87.5 84.7 86.7 86.1 ...
## $ Feature40 : num 118 118 119 117 121 ...
## $ Feature41 : num 61.3 78.2 14.4 76.9 76.4 ...
## $ Feature42 : num 4.51 2.77 5.43 1.28 2.21 ...
## $ Feature43 : int 70 70 70 70 70 70 70 70 70 70 ...
## $ Feature44 : num 353 352 364 363 353 ...
## $ Feature45 : num 10.18 10.04 9.88 9.93 10.41 ...
## $ Feature46 : num 130 133 132 132 176 ...
## $ Feature47 : num 723 725 735 734 790 ...
## $ Feature48 : num 1.31 1.29 1.3 1.3 1.03 ...
## $ Feature49 : num 141 146 141 143 138 ...
## $ Feature50 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature51 : num 624 631 637 637 668 ...
## $ Feature52 : num 218 205 186 190 234 ...
## $ Feature53 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature54 : num 4.59 4.59 4.49 4.49 4.62 ...
## $ Feature55 : num 4.84 4.84 4.75 4.75 4.89 ...
## $ Feature56 : int 2834 2853 2936 2936 2865 2865 2853 2865 2865 2865 ...
## $ Feature57 : num 0.932 0.932 0.914 0.914 0.93 ...
## $ Feature58 : num 0.948 0.948 0.945 0.945 0.945 ...
## $ Feature59 : num 4.71 4.68 4.59 4.59 4.64 ...
## $ Feature60 : num -1.726 0.807 23.825 24.379 -12.294 ...
## $ Feature61 : num 351 352 365 361 355 ...
## $ Feature62 : num 10.62 10.31 10.17 10.21 9.79 ...
## $ Feature63 : num 109 114 116 116 144 ...
## $ Feature64 : num 16.1 10.9 11.3 13.6 22 ...
## $ Feature65 : num 21.7 19.2 16.2 15.6 32.3 ...
## $ Feature66 : num 29.5 27.6 24.3 23.5 44.1 ...
## $ Feature67 : num 694 697 711 710 746 ...
## $ Feature68 : num 0.923 1.16 0.869 0.976 0.926 ...
## $ Feature69 : num 149 154 146 148 147 ...
## $ Feature70 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Feature71 : num 608 620 626 625 646 ...
## $ Feature72 : num 84.1 82.3 84.8 70.2 65.8 ...
## $ Feature73 : num NA NA 141 160 NA ...
## $ Feature74 : num NA NA 485 465 NA ...
## $ Feature75 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Feature76 : num 0.0126 -0.0039 -0.0078 -0.0555 -0.0534 0.0198 0.0227 -0.0187 -0.0088 -0.0117 ...
## $ Feature77 : num -0.0206 -0.0198 -0.0326 -0.0461 0.0183 -0.055 -0.0132 -0.0594 -0.0523 -0.0506 ...
## $ Feature78 : num 0.0141 0.0004 -0.0052 -0.04 -0.0167 -0.001 -0.0321 0.0249 -0.0629 -0.0085 ...
## $ Feature79 : num -0.0307 -0.044 0.0213 0.04 -0.0449 -0.0074 -0.0389 0.0041 0.0067 0.0065 ...
## $ Feature80 : num -0.0083 -0.0358 -0.0054 0.0676 0.0034 -0.0026 -0.0154 0.0002 -0.0047 0.0102 ...
## $ Feature81 : num -0.0026 -0.012 -0.1134 -0.1051 -0.0178 ...
## $ Feature82 : num -0.0567 -0.0377 -0.0182 0.0028 -0.0123 -0.0111 -0.0281 0.0004 -0.0251 -0.0375 ...
## $ Feature83 : num -0.0044 0.0017 0.0287 0.0277 -0.0048 0.021 0.0288 0.0086 0.0306 0.0008 ...
## $ Feature84 : num 7.22 6.8 7.1 7.59 7.5 ...
## $ Feature85 : num 0.132 0.136 0.136 0.13 0.134 ...
## $ Feature86 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Feature87 : num 2.39 2.38 2.45 2.4 2.45 ...
## $ Feature88 : num 0.969 0.989 0.988 0.99 0.99 ...
## $ Feature89 : num 1748 1932 1686 1752 1828 ...
## $ Feature90 : num 0.184 0.187 0.15 0.196 0.183 ...
## $ Feature91 : num 8672 8407 9317 8206 9014 ...
## $ Feature92 : num -0.3274 0.1455 0.0553 0.0697 0.0448 ...
## $ Feature93 : num -0.0055 -0.0015 0.0006 -0.0003 -0.0077 0.0009 0.0013 -0.0028 0.0005 -0.0025 ...
## $ Feature94 : num -0.0001 0 -0.0013 -0.0021 -0.0001 -0.0004 -0.0028 -0.0018 -0.0033 0.0004 ...
## $ Feature95 : num 1e-04 -5e-04 0e+00 -1e-04 -1e-04 -2e-04 0e+00 0e+00 0e+00 0e+00 ...
## $ Feature96 : num 3e-04 1e-04 2e-04 2e-04 -1e-04 1e-04 1e-04 2e-04 2e-04 2e-04 ...
## $ Feature97 : num -0.2786 0.5854 -0.1343 0.0411 0.2189 ...
## [list output truncated]
Will use generalized linear model.
model<-glm(Class~Feature1+Feature2+Feature3,family=binomial, data=newdat)
model$coefficients
## (Intercept) Feature1 Feature2 Feature3
## -2.4963994165 0.0013849361 0.0001795779 0.0002330647
summary(model)
##
## Call:
## glm(formula = Class ~ Feature1 + Feature2 + Feature3, family = binomial,
## data = newdat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4963994 9.5500232 -0.261 0.794
## Feature1 0.0013849 0.0014292 0.969 0.333
## Feature2 0.0001796 0.0013108 0.137 0.891
## Feature3 0.0002331 0.0034634 0.067 0.946
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 756.14 on 1539 degrees of freedom
## Residual deviance: 755.18 on 1536 degrees of freedom
## (27 observations deleted due to missingness)
## AIC: 763.18
##
## Number of Fisher Scoring iterations: 5
This requires logistic regression:
We would really like to assume constant variance. This is very important. About the residuals - this is the difference between the model prediction and the observed value. The epsilons are this difference. We are assuming they are normal zero sigma squared. The third step in our regression is to assess the adequacy of the model through examining the residuals. Let’s take a normal probability plot of the residuals. And we’re going to plot residuals against predicted values. The normal probability plot should be a straight line. THe plot of residuals versus predicted values should not show any structure. Why plots? Going back to the linear model.
plot(model3)
The first plot is the residuals versus model value. There is essentially no structure. The second is the QQ residuals, which are essentially a straight line. Scale location tells something about outliers. The last plot indicates what effect outliers have on the fit.
Let’s compare:
#check for non-constant variance
par(mfcol=c(1,3))
plot(model1,1)
plot(model2,1)
plot(model3,1)
#check for normality
par(mfcol=c(1,3))
plot(model1,2)
plot(model2,2)
plot(model3,2)
par(mfcol=c(1,3))
plot(model1,3)
plot(model2,3)
plot(model3,3)
par(mfcol=c(1,3))
plot(model1,5)
plot(model2,5)
plot(model3,5)
What if we’re not happy with these results? Try some transformations.
Outliers that are near or exceed the range of the predictor will have a larger effect on the slope than outliers near the center of the range.
The sum of squares total will be the sum of all yi - ybar
The sum of squares regression is the sum of all ypred - ybar
The sum of squares error is the sum of all yi-ypred
R2 is the ratio of SSregression/SStotal. If the regression explains the data exactly, the SSE disappears and R2 is 1.
What’s the p value? Remember table for ANOVA? The column labeled f is the MSregression/MSE, so a really big p value says the error is small compared to the MS regression.
R2 can be misleading. If you put in enough adjustable parameters you can make R2 as close to 1 as you like. Adjusted R2 will show whether teh factors really have an effect.
Look at summaries of the fits.
summary(model1)
##
## Call:
## lm(formula = Strength ~ Length, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8889 -1.3199 0.3547 2.0030 5.8314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.114 1.146 4.464 0.000177 ***
## Length 2.903 0.117 24.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.093 on 23 degrees of freedom
## Multiple R-squared: 0.964, Adjusted R-squared: 0.9624
## F-statistic: 615.1 on 1 and 23 DF, p-value: < 2.2e-16
P value is tiny and R2 is very good.
summary(model2)
##
## Call:
## lm(formula = Strength ~ Height, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.774 -7.235 -1.856 5.582 28.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.56780 6.03259 2.415 0.0241 *
## Height 0.04360 0.01605 2.717 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.18 on 23 degrees of freedom
## Multiple R-squared: 0.2429, Adjusted R-squared: 0.21
## F-statistic: 7.38 on 1 and 23 DF, p-value: 0.01231
p value says it’s significant but R2 is lousy. Fit line is pretty horizontal (p value) but there’s a lot of spread about the fit line (R2)
summary(model3)
##
## Call:
## lm(formula = Strength ~ Length + Height + Length:Height, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8113 -1.1029 -0.0058 0.5374 5.1170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6029761 1.4158295 4.664 0.000133 ***
## Length 2.1243159 0.1791705 11.856 9.09e-11 ***
## Height 0.0010646 0.0037394 0.285 0.778665
## Length:Height 0.0014811 0.0003901 3.796 0.001056 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.803 on 21 degrees of freedom
## Multiple R-squared: 0.9888, Adjusted R-squared: 0.9872
## F-statistic: 618.8 on 3 and 21 DF, p-value: < 2.2e-16
Hey - this is brilliant, tiny p value, large R2. But we should expect R2 to go up with additional fit parameter.
Looking at model3, can’t throw out height because of the significance of the interaction, even though height by itself is not very important. Many times this analysis will be repeated many times.
Let’s look at confidence and prediction intervals:
predict(model3, newdata=data.frame(Length=3, Height=300),
interval="confidence")
## fit lwr upr
## 1 14.62829 13.54091 15.71567
predict(model3, newdata=data.frame(Length=3, Height=300),
interval="prediction")
## fit lwr upr
## 1 14.62829 10.72339 18.53319
newdf<-data.frame(
Length=c(3,5),
Height=c(300,200))
data.frame(newdf,predict(model3,
newdata=newdf,
interval="confidence"))
## Length Height fit lwr upr
## 1 3 300 14.62829 13.54091 15.71567
## 2 5 200 18.91857 17.91739 19.91975
There’s a significant difference.