Logistic & Linear models
library(car);
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.2
library(MASS);
## Warning: package 'MASS' was built under R version 4.1.2
Let’s take a look at the dataset we’ll be working with.
typeof(cement)
## [1] "list"
View(cement)
?cement
Let’s build a linear model for the MASS:cement data set
lmres<-lm(y~x1+x2+x3+x4, data=MASS::cement);
summary(lmres);
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = MASS::cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## x1 1.5511 0.7448 2.083 0.0708 .
## x2 0.5102 0.7238 0.705 0.5009
## x3 0.1019 0.7547 0.135 0.8959
## x4 -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
We can see that none of the variables are well determined. This is because of the p values are larger than 0.05. Let’s remove one variable, x3 and see what happens.
lmres<-lm(y~x1+x2+x4, data=MASS::cement);
summary(lmres);
##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = MASS::cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
Let’s also remove x4
lmres<-lm(y~x1+x2, data=MASS::cement);
summary(lmres);
##
## Call:
## lm(formula = y ~ x1 + x2, data = MASS::cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
As we see, now the remaining coefficients are well determined. This says that the p-values of predictors depend on the presence of other variables. This is why we can’t just remove all variables whose p-value is larger than 0.05. StepAIC automatically selects variables for us.
lmres<-lm(y~x1+x2+x3+x4, data=MASS::cement);
step<-stepAIC(lmres, direction='both');
## Start: AIC=26.94
## y ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x3 1 0.1091 47.973 24.974
## - x4 1 0.2470 48.111 25.011
## - x2 1 2.9725 50.836 25.728
## <none> 47.864 26.944
## - x1 1 25.9509 73.815 30.576
##
## Step: AIC=24.97
## y ~ x1 + x2 + x4
##
## Df Sum of Sq RSS AIC
## <none> 47.97 24.974
## - x4 1 9.93 57.90 25.420
## + x3 1 0.11 47.86 26.944
## - x2 1 26.79 74.76 28.742
## - x1 1 820.91 868.88 60.629
step$anova;
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## y ~ x1 + x2 + x3 + x4
##
## Final Model:
## y ~ x1 + x2 + x4
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 8 47.86364 26.94429
## 2 - x3 1 0.10909 9 47.97273 24.97388
As we can see, the results from stepAIC are different from what we did. The final model from stepAIC looks problematic:
lmres<-lm(y~x1+x2+x4, data=MASS::cement);
summary(lmres);
##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = MASS::cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
x2, and x4 are not well determined.
I still like y~x1+x2 better.
library(MASS);
data("Cars93");
Cars93;
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## 1 Acura Integra Small 12.9 15.9 18.8 25
## 2 Acura Legend Midsize 29.2 33.9 38.7 18
## 3 Audi 90 Compact 25.9 29.1 32.3 20
## 4 Audi 100 Midsize 30.8 37.7 44.6 19
## 5 BMW 535i Midsize 23.7 30.0 36.2 22
## 6 Buick Century Midsize 14.2 15.7 17.3 22
## 7 Buick LeSabre Large 19.9 20.8 21.7 19
## 8 Buick Roadmaster Large 22.6 23.7 24.9 16
## 9 Buick Riviera Midsize 26.3 26.3 26.3 19
## 10 Cadillac DeVille Large 33.0 34.7 36.3 16
## 11 Cadillac Seville Midsize 37.5 40.1 42.7 16
## 12 Chevrolet Cavalier Compact 8.5 13.4 18.3 25
## 13 Chevrolet Corsica Compact 11.4 11.4 11.4 25
## 14 Chevrolet Camaro Sporty 13.4 15.1 16.8 19
## 15 Chevrolet Lumina Midsize 13.4 15.9 18.4 21
## 16 Chevrolet Lumina_APV Van 14.7 16.3 18.0 18
## 17 Chevrolet Astro Van 14.7 16.6 18.6 15
## 18 Chevrolet Caprice Large 18.0 18.8 19.6 17
## 19 Chevrolet Corvette Sporty 34.6 38.0 41.5 17
## 20 Chrylser Concorde Large 18.4 18.4 18.4 20
## 21 Chrysler LeBaron Compact 14.5 15.8 17.1 23
## 22 Chrysler Imperial Large 29.5 29.5 29.5 20
## 23 Dodge Colt Small 7.9 9.2 10.6 29
## 24 Dodge Shadow Small 8.4 11.3 14.2 23
## 25 Dodge Spirit Compact 11.9 13.3 14.7 22
## 26 Dodge Caravan Van 13.6 19.0 24.4 17
## 27 Dodge Dynasty Midsize 14.8 15.6 16.4 21
## 28 Dodge Stealth Sporty 18.5 25.8 33.1 18
## 29 Eagle Summit Small 7.9 12.2 16.5 29
## 30 Eagle Vision Large 17.5 19.3 21.2 20
## 31 Ford Festiva Small 6.9 7.4 7.9 31
## 32 Ford Escort Small 8.4 10.1 11.9 23
## 33 Ford Tempo Compact 10.4 11.3 12.2 22
## 34 Ford Mustang Sporty 10.8 15.9 21.0 22
## 35 Ford Probe Sporty 12.8 14.0 15.2 24
## 36 Ford Aerostar Van 14.5 19.9 25.3 15
## 37 Ford Taurus Midsize 15.6 20.2 24.8 21
## 38 Ford Crown_Victoria Large 20.1 20.9 21.7 18
## 39 Geo Metro Small 6.7 8.4 10.0 46
## 40 Geo Storm Sporty 11.5 12.5 13.5 30
## 41 Honda Prelude Sporty 17.0 19.8 22.7 24
## 42 Honda Civic Small 8.4 12.1 15.8 42
## 43 Honda Accord Compact 13.8 17.5 21.2 24
## 44 Hyundai Excel Small 6.8 8.0 9.2 29
## 45 Hyundai Elantra Small 9.0 10.0 11.0 22
## 46 Hyundai Scoupe Sporty 9.1 10.0 11.0 26
## 47 Hyundai Sonata Midsize 12.4 13.9 15.3 20
## 48 Infiniti Q45 Midsize 45.4 47.9 50.4 17
## 49 Lexus ES300 Midsize 27.5 28.0 28.4 18
## 50 Lexus SC300 Midsize 34.7 35.2 35.6 18
## 51 Lincoln Continental Midsize 33.3 34.3 35.3 17
## 52 Lincoln Town_Car Large 34.4 36.1 37.8 18
## 53 Mazda 323 Small 7.4 8.3 9.1 29
## 54 Mazda Protege Small 10.9 11.6 12.3 28
## 55 Mazda 626 Compact 14.3 16.5 18.7 26
## 56 Mazda MPV Van 16.6 19.1 21.7 18
## 57 Mazda RX-7 Sporty 32.5 32.5 32.5 17
## 58 Mercedes-Benz 190E Compact 29.0 31.9 34.9 20
## 59 Mercedes-Benz 300E Midsize 43.8 61.9 80.0 19
## 60 Mercury Capri Sporty 13.3 14.1 15.0 23
## 61 Mercury Cougar Midsize 14.9 14.9 14.9 19
## 62 Mitsubishi Mirage Small 7.7 10.3 12.9 29
## 63 Mitsubishi Diamante Midsize 22.4 26.1 29.9 18
## 64 Nissan Sentra Small 8.7 11.8 14.9 29
## 65 Nissan Altima Compact 13.0 15.7 18.3 24
## 66 Nissan Quest Van 16.7 19.1 21.5 17
## 67 Nissan Maxima Midsize 21.0 21.5 22.0 21
## 68 Oldsmobile Achieva Compact 13.0 13.5 14.0 24
## 69 Oldsmobile Cutlass_Ciera Midsize 14.2 16.3 18.4 23
## 70 Oldsmobile Silhouette Van 19.5 19.5 19.5 18
## 71 Oldsmobile Eighty-Eight Large 19.5 20.7 21.9 19
## 72 Plymouth Laser Sporty 11.4 14.4 17.4 23
## 73 Pontiac LeMans Small 8.2 9.0 9.9 31
## 74 Pontiac Sunbird Compact 9.4 11.1 12.8 23
## 75 Pontiac Firebird Sporty 14.0 17.7 21.4 19
## 76 Pontiac Grand_Prix Midsize 15.4 18.5 21.6 19
## 77 Pontiac Bonneville Large 19.4 24.4 29.4 19
## 78 Saab 900 Compact 20.3 28.7 37.1 20
## 79 Saturn SL Small 9.2 11.1 12.9 28
## 80 Subaru Justy Small 7.3 8.4 9.5 33
## 81 Subaru Loyale Small 10.5 10.9 11.3 25
## 82 Subaru Legacy Compact 16.3 19.5 22.7 23
## 83 Suzuki Swift Small 7.3 8.6 10.0 39
## 84 Toyota Tercel Small 7.8 9.8 11.8 32
## 85 Toyota Celica Sporty 14.2 18.4 22.6 25
## 86 Toyota Camry Midsize 15.2 18.2 21.2 22
## 87 Toyota Previa Van 18.9 22.7 26.6 18
## 88 Volkswagen Fox Small 8.7 9.1 9.5 25
## 89 Volkswagen Eurovan Van 16.6 19.7 22.7 17
## 90 Volkswagen Passat Compact 17.6 20.0 22.4 21
## 91 Volkswagen Corrado Sporty 22.9 23.3 23.7 18
## 92 Volvo 240 Compact 21.8 22.7 23.5 21
## 93 Volvo 850 Midsize 24.8 26.7 28.5 20
## MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower
## 1 31 None Front 4 1.8 140
## 2 25 Driver & Passenger Front 6 3.2 200
## 3 26 Driver only Front 6 2.8 172
## 4 26 Driver & Passenger Front 6 2.8 172
## 5 30 Driver only Rear 4 3.5 208
## 6 31 Driver only Front 4 2.2 110
## 7 28 Driver only Front 6 3.8 170
## 8 25 Driver only Rear 6 5.7 180
## 9 27 Driver only Front 6 3.8 170
## 10 25 Driver only Front 8 4.9 200
## 11 25 Driver & Passenger Front 8 4.6 295
## 12 36 None Front 4 2.2 110
## 13 34 Driver only Front 4 2.2 110
## 14 28 Driver & Passenger Rear 6 3.4 160
## 15 29 None Front 4 2.2 110
## 16 23 None Front 6 3.8 170
## 17 20 None 4WD 6 4.3 165
## 18 26 Driver only Rear 8 5.0 170
## 19 25 Driver only Rear 8 5.7 300
## 20 28 Driver & Passenger Front 6 3.3 153
## 21 28 Driver & Passenger Front 4 3.0 141
## 22 26 Driver only Front 6 3.3 147
## 23 33 None Front 4 1.5 92
## 24 29 Driver only Front 4 2.2 93
## 25 27 Driver only Front 4 2.5 100
## 26 21 Driver only 4WD 6 3.0 142
## 27 27 Driver only Front 4 2.5 100
## 28 24 Driver only 4WD 6 3.0 300
## 29 33 None Front 4 1.5 92
## 30 28 Driver & Passenger Front 6 3.5 214
## 31 33 None Front 4 1.3 63
## 32 30 None Front 4 1.8 127
## 33 27 None Front 4 2.3 96
## 34 29 Driver only Rear 4 2.3 105
## 35 30 Driver only Front 4 2.0 115
## 36 20 Driver only 4WD 6 3.0 145
## 37 30 Driver only Front 6 3.0 140
## 38 26 Driver only Rear 8 4.6 190
## 39 50 None Front 3 1.0 55
## 40 36 Driver only Front 4 1.6 90
## 41 31 Driver & Passenger Front 4 2.3 160
## 42 46 Driver only Front 4 1.5 102
## 43 31 Driver & Passenger Front 4 2.2 140
## 44 33 None Front 4 1.5 81
## 45 29 None Front 4 1.8 124
## 46 34 None Front 4 1.5 92
## 47 27 None Front 4 2.0 128
## 48 22 Driver only Rear 8 4.5 278
## 49 24 Driver only Front 6 3.0 185
## 50 23 Driver & Passenger Rear 6 3.0 225
## 51 26 Driver & Passenger Front 6 3.8 160
## 52 26 Driver & Passenger Rear 8 4.6 210
## 53 37 None Front 4 1.6 82
## 54 36 None Front 4 1.8 103
## 55 34 Driver only Front 4 2.5 164
## 56 24 None 4WD 6 3.0 155
## 57 25 Driver only Rear rotary 1.3 255
## 58 29 Driver only Rear 4 2.3 130
## 59 25 Driver & Passenger Rear 6 3.2 217
## 60 26 Driver only Front 4 1.6 100
## 61 26 None Rear 6 3.8 140
## 62 33 None Front 4 1.5 92
## 63 24 Driver only Front 6 3.0 202
## 64 33 Driver only Front 4 1.6 110
## 65 30 Driver only Front 4 2.4 150
## 66 23 None Front 6 3.0 151
## 67 26 Driver only Front 6 3.0 160
## 68 31 None Front 4 2.3 155
## 69 31 Driver only Front 4 2.2 110
## 70 23 None Front 6 3.8 170
## 71 28 Driver only Front 6 3.8 170
## 72 30 None 4WD 4 1.8 92
## 73 41 None Front 4 1.6 74
## 74 31 None Front 4 2.0 110
## 75 28 Driver & Passenger Rear 6 3.4 160
## 76 27 None Front 6 3.4 200
## 77 28 Driver & Passenger Front 6 3.8 170
## 78 26 Driver only Front 4 2.1 140
## 79 38 Driver only Front 4 1.9 85
## 80 37 None 4WD 3 1.2 73
## 81 30 None 4WD 4 1.8 90
## 82 30 Driver only 4WD 4 2.2 130
## 83 43 None Front 3 1.3 70
## 84 37 Driver only Front 4 1.5 82
## 85 32 Driver only Front 4 2.2 135
## 86 29 Driver only Front 4 2.2 130
## 87 22 Driver only 4WD 4 2.4 138
## 88 33 None Front 4 1.8 81
## 89 21 None Front 5 2.5 109
## 90 30 None Front 4 2.0 134
## 91 25 None Front 6 2.8 178
## 92 28 Driver only Rear 4 2.3 114
## 93 28 Driver & Passenger Front 5 2.4 168
## RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
## 1 6300 2890 Yes 13.2 5 177
## 2 5500 2335 Yes 18.0 5 195
## 3 5500 2280 Yes 16.9 5 180
## 4 5500 2535 Yes 21.1 6 193
## 5 5700 2545 Yes 21.1 4 186
## 6 5200 2565 No 16.4 6 189
## 7 4800 1570 No 18.0 6 200
## 8 4000 1320 No 23.0 6 216
## 9 4800 1690 No 18.8 5 198
## 10 4100 1510 No 18.0 6 206
## 11 6000 1985 No 20.0 5 204
## 12 5200 2380 Yes 15.2 5 182
## 13 5200 2665 Yes 15.6 5 184
## 14 4600 1805 Yes 15.5 4 193
## 15 5200 2595 No 16.5 6 198
## 16 4800 1690 No 20.0 7 178
## 17 4000 1790 No 27.0 8 194
## 18 4200 1350 No 23.0 6 214
## 19 5000 1450 Yes 20.0 2 179
## 20 5300 1990 No 18.0 6 203
## 21 5000 2090 No 16.0 6 183
## 22 4800 1785 No 16.0 6 203
## 23 6000 3285 Yes 13.2 5 174
## 24 4800 2595 Yes 14.0 5 172
## 25 4800 2535 Yes 16.0 6 181
## 26 5000 1970 No 20.0 7 175
## 27 4800 2465 No 16.0 6 192
## 28 6000 2120 Yes 19.8 4 180
## 29 6000 2505 Yes 13.2 5 174
## 30 5800 1980 No 18.0 6 202
## 31 5000 3150 Yes 10.0 4 141
## 32 6500 2410 Yes 13.2 5 171
## 33 4200 2805 Yes 15.9 5 177
## 34 4600 2285 Yes 15.4 4 180
## 35 5500 2340 Yes 15.5 4 179
## 36 4800 2080 Yes 21.0 7 176
## 37 4800 1885 No 16.0 5 192
## 38 4200 1415 No 20.0 6 212
## 39 5700 3755 Yes 10.6 4 151
## 40 5400 3250 Yes 12.4 4 164
## 41 5800 2855 Yes 15.9 4 175
## 42 5900 2650 Yes 11.9 4 173
## 43 5600 2610 Yes 17.0 4 185
## 44 5500 2710 Yes 11.9 5 168
## 45 6000 2745 Yes 13.7 5 172
## 46 5550 2540 Yes 11.9 4 166
## 47 6000 2335 Yes 17.2 5 184
## 48 6000 1955 No 22.5 5 200
## 49 5200 2325 Yes 18.5 5 188
## 50 6000 2510 Yes 20.6 4 191
## 51 4400 1835 No 18.4 6 205
## 52 4600 1840 No 20.0 6 219
## 53 5000 2370 Yes 13.2 4 164
## 54 5500 2220 Yes 14.5 5 172
## 55 5600 2505 Yes 15.5 5 184
## 56 5000 2240 No 19.6 7 190
## 57 6500 2325 Yes 20.0 2 169
## 58 5100 2425 Yes 14.5 5 175
## 59 5500 2220 No 18.5 5 187
## 60 5750 2475 Yes 11.1 4 166
## 61 3800 1730 No 18.0 5 199
## 62 6000 2505 Yes 13.2 5 172
## 63 6000 2210 No 19.0 5 190
## 64 6000 2435 Yes 13.2 5 170
## 65 5600 2130 Yes 15.9 5 181
## 66 4800 2065 No 20.0 7 190
## 67 5200 2045 No 18.5 5 188
## 68 6000 2380 No 15.2 5 188
## 69 5200 2565 No 16.5 5 190
## 70 4800 1690 No 20.0 7 194
## 71 4800 1570 No 18.0 6 201
## 72 5000 2360 Yes 15.9 4 173
## 73 5600 3130 Yes 13.2 4 177
## 74 5200 2665 Yes 15.2 5 181
## 75 4600 1805 Yes 15.5 4 196
## 76 5000 1890 Yes 16.5 5 195
## 77 4800 1565 No 18.0 6 177
## 78 6000 2910 Yes 18.0 5 184
## 79 5000 2145 Yes 12.8 5 176
## 80 5600 2875 Yes 9.2 4 146
## 81 5200 3375 Yes 15.9 5 175
## 82 5600 2330 Yes 15.9 5 179
## 83 6000 3360 Yes 10.6 4 161
## 84 5200 3505 Yes 11.9 5 162
## 85 5400 2405 Yes 15.9 4 174
## 86 5400 2340 Yes 18.5 5 188
## 87 5000 2515 Yes 19.8 7 187
## 88 5500 2550 Yes 12.4 4 163
## 89 4500 2915 Yes 21.1 7 187
## 90 5800 2685 Yes 18.5 5 180
## 91 5800 2385 Yes 18.5 4 159
## 92 5400 2215 Yes 15.8 5 190
## 93 6200 2310 Yes 19.3 5 184
## Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
## 1 102 68 37 26.5 11 2705 non-USA
## 2 115 71 38 30.0 15 3560 non-USA
## 3 102 67 37 28.0 14 3375 non-USA
## 4 106 70 37 31.0 17 3405 non-USA
## 5 109 69 39 27.0 13 3640 non-USA
## 6 105 69 41 28.0 16 2880 USA
## 7 111 74 42 30.5 17 3470 USA
## 8 116 78 45 30.5 21 4105 USA
## 9 108 73 41 26.5 14 3495 USA
## 10 114 73 43 35.0 18 3620 USA
## 11 111 74 44 31.0 14 3935 USA
## 12 101 66 38 25.0 13 2490 USA
## 13 103 68 39 26.0 14 2785 USA
## 14 101 74 43 25.0 13 3240 USA
## 15 108 71 40 28.5 16 3195 USA
## 16 110 74 44 30.5 NA 3715 USA
## 17 111 78 42 33.5 NA 4025 USA
## 18 116 77 42 29.5 20 3910 USA
## 19 96 74 43 NA NA 3380 USA
## 20 113 74 40 31.0 15 3515 USA
## 21 104 68 41 30.5 14 3085 USA
## 22 110 69 44 36.0 17 3570 USA
## 23 98 66 32 26.5 11 2270 USA
## 24 97 67 38 26.5 13 2670 USA
## 25 104 68 39 30.5 14 2970 USA
## 26 112 72 42 26.5 NA 3705 USA
## 27 105 69 42 30.5 16 3080 USA
## 28 97 72 40 20.0 11 3805 USA
## 29 98 66 36 26.5 11 2295 USA
## 30 113 74 40 30.0 15 3490 USA
## 31 90 63 33 26.0 12 1845 USA
## 32 98 67 36 28.0 12 2530 USA
## 33 100 68 39 27.5 13 2690 USA
## 34 101 68 40 24.0 12 2850 USA
## 35 103 70 38 23.0 18 2710 USA
## 36 119 72 45 30.0 NA 3735 USA
## 37 106 71 40 27.5 18 3325 USA
## 38 114 78 43 30.0 21 3950 USA
## 39 93 63 34 27.5 10 1695 non-USA
## 40 97 67 37 24.5 11 2475 non-USA
## 41 100 70 39 23.5 8 2865 non-USA
## 42 103 67 36 28.0 12 2350 non-USA
## 43 107 67 41 28.0 14 3040 non-USA
## 44 94 63 35 26.0 11 2345 non-USA
## 45 98 66 36 28.0 12 2620 non-USA
## 46 94 64 34 23.5 9 2285 non-USA
## 47 104 69 41 31.0 14 2885 non-USA
## 48 113 72 42 29.0 15 4000 non-USA
## 49 103 70 40 27.5 14 3510 non-USA
## 50 106 71 39 25.0 9 3515 non-USA
## 51 109 73 42 30.0 19 3695 USA
## 52 117 77 45 31.5 22 4055 USA
## 53 97 66 34 27.0 16 2325 non-USA
## 54 98 66 36 26.5 13 2440 non-USA
## 55 103 69 40 29.5 14 2970 non-USA
## 56 110 72 39 27.5 NA 3735 non-USA
## 57 96 69 37 NA NA 2895 non-USA
## 58 105 67 34 26.0 12 2920 non-USA
## 59 110 69 37 27.0 15 3525 non-USA
## 60 95 65 36 19.0 6 2450 USA
## 61 113 73 38 28.0 15 3610 USA
## 62 98 67 36 26.0 11 2295 non-USA
## 63 107 70 43 27.5 14 3730 non-USA
## 64 96 66 33 26.0 12 2545 non-USA
## 65 103 67 40 28.5 14 3050 non-USA
## 66 112 74 41 27.0 NA 4100 non-USA
## 67 104 69 41 28.5 14 3200 non-USA
## 68 103 67 39 28.0 14 2910 USA
## 69 105 70 42 28.0 16 2890 USA
## 70 110 74 44 30.5 NA 3715 USA
## 71 111 74 42 31.5 17 3470 USA
## 72 97 67 39 24.5 8 2640 USA
## 73 99 66 35 25.5 17 2350 USA
## 74 101 66 39 25.0 13 2575 USA
## 75 101 75 43 25.0 13 3240 USA
## 76 108 72 41 28.5 16 3450 USA
## 77 111 74 43 30.5 18 3495 USA
## 78 99 67 37 26.5 14 2775 non-USA
## 79 102 68 40 26.5 12 2495 USA
## 80 90 60 32 23.5 10 2045 non-USA
## 81 97 65 35 27.5 15 2490 non-USA
## 82 102 67 37 27.0 14 3085 non-USA
## 83 93 63 34 27.5 10 1965 non-USA
## 84 94 65 36 24.0 11 2055 non-USA
## 85 99 69 39 23.0 13 2950 non-USA
## 86 103 70 38 28.5 15 3030 non-USA
## 87 113 71 41 35.0 NA 3785 non-USA
## 88 93 63 34 26.0 10 2240 non-USA
## 89 115 72 38 34.0 NA 3960 non-USA
## 90 103 67 35 31.5 14 2985 non-USA
## 91 97 66 36 26.0 15 2810 non-USA
## 92 104 67 37 29.5 14 2985 non-USA
## 93 105 69 38 30.0 15 3245 non-USA
## Make
## 1 Acura Integra
## 2 Acura Legend
## 3 Audi 90
## 4 Audi 100
## 5 BMW 535i
## 6 Buick Century
## 7 Buick LeSabre
## 8 Buick Roadmaster
## 9 Buick Riviera
## 10 Cadillac DeVille
## 11 Cadillac Seville
## 12 Chevrolet Cavalier
## 13 Chevrolet Corsica
## 14 Chevrolet Camaro
## 15 Chevrolet Lumina
## 16 Chevrolet Lumina_APV
## 17 Chevrolet Astro
## 18 Chevrolet Caprice
## 19 Chevrolet Corvette
## 20 Chrylser Concorde
## 21 Chrysler LeBaron
## 22 Chrysler Imperial
## 23 Dodge Colt
## 24 Dodge Shadow
## 25 Dodge Spirit
## 26 Dodge Caravan
## 27 Dodge Dynasty
## 28 Dodge Stealth
## 29 Eagle Summit
## 30 Eagle Vision
## 31 Ford Festiva
## 32 Ford Escort
## 33 Ford Tempo
## 34 Ford Mustang
## 35 Ford Probe
## 36 Ford Aerostar
## 37 Ford Taurus
## 38 Ford Crown_Victoria
## 39 Geo Metro
## 40 Geo Storm
## 41 Honda Prelude
## 42 Honda Civic
## 43 Honda Accord
## 44 Hyundai Excel
## 45 Hyundai Elantra
## 46 Hyundai Scoupe
## 47 Hyundai Sonata
## 48 Infiniti Q45
## 49 Lexus ES300
## 50 Lexus SC300
## 51 Lincoln Continental
## 52 Lincoln Town_Car
## 53 Mazda 323
## 54 Mazda Protege
## 55 Mazda 626
## 56 Mazda MPV
## 57 Mazda RX-7
## 58 Mercedes-Benz 190E
## 59 Mercedes-Benz 300E
## 60 Mercury Capri
## 61 Mercury Cougar
## 62 Mitsubishi Mirage
## 63 Mitsubishi Diamante
## 64 Nissan Sentra
## 65 Nissan Altima
## 66 Nissan Quest
## 67 Nissan Maxima
## 68 Oldsmobile Achieva
## 69 Oldsmobile Cutlass_Ciera
## 70 Oldsmobile Silhouette
## 71 Oldsmobile Eighty-Eight
## 72 Plymouth Laser
## 73 Pontiac LeMans
## 74 Pontiac Sunbird
## 75 Pontiac Firebird
## 76 Pontiac Grand_Prix
## 77 Pontiac Bonneville
## 78 Saab 900
## 79 Saturn SL
## 80 Subaru Justy
## 81 Subaru Loyale
## 82 Subaru Legacy
## 83 Suzuki Swift
## 84 Toyota Tercel
## 85 Toyota Celica
## 86 Toyota Camry
## 87 Toyota Previa
## 88 Volkswagen Fox
## 89 Volkswagen Eurovan
## 90 Volkswagen Passat
## 91 Volkswagen Corrado
## 92 Volvo 240
## 93 Volvo 850
Here is an example for logistic regressions. Some of the steps are similar. In this case, the data set is already built and the glm function can build a model directly if there are two classes. There are some warnings, which are cased by some combination of your predictors we observed only successes of failures. That is a quite common phenomenon for logistic regression in small samples or if you have large parameter spaces.
Make sure to put in the link=logit and family=binomial! For the prediction, the model decides what is 1 and 0 based on alphabetical order so in this case, 1 is USA, 0 is non-USA
glmres<-glm(Origin~Min.Price+Max.Price+MPG.city+Type+MPG.highway+AirBags+Length+Wheelbase+Width+Weight+Turn.circle+Luggage.room + Rear.seat.room+Fuel.tank.capacity+Passengers+Man.trans.avail+Horsepower+Cylinders
+EngineSize+DriveTrain+RPM,data=Cars93, family=binomial(link=logit));
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glmres);
##
## Call:
## glm(formula = Origin ~ Min.Price + Max.Price + MPG.city + Type +
## MPG.highway + AirBags + Length + Wheelbase + Width + Weight +
## Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity +
## Passengers + Man.trans.avail + Horsepower + Cylinders + EngineSize +
## DriveTrain + RPM, family = binomial(link = logit), data = Cars93)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.593e-05 -2.100e-08 -2.100e-08 2.100e-08 8.239e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.158e+02 1.311e+06 0.000 1.000
## Min.Price 5.881e+01 1.620e+04 0.004 0.997
## Max.Price -1.721e+01 8.233e+03 -0.002 0.998
## MPG.city 5.212e-01 3.380e+04 0.000 1.000
## TypeLarge -9.275e+01 4.497e+05 0.000 1.000
## TypeMidsize -4.235e+01 2.879e+05 0.000 1.000
## TypeSmall 9.118e+01 1.835e+05 0.000 1.000
## TypeSporty 7.397e+01 2.312e+05 0.000 1.000
## MPG.highway 6.467e+01 5.724e+04 0.001 0.999
## AirBagsDriver only -5.501e+01 2.025e+05 0.000 1.000
## AirBagsNone 1.563e+02 1.515e+05 0.001 0.999
## Length -2.885e+01 1.828e+04 -0.002 0.999
## Wheelbase -1.380e+00 3.317e+04 0.000 1.000
## Width -6.534e+01 1.763e+04 -0.004 0.997
## Weight 8.224e-01 5.387e+02 0.002 0.999
## Turn.circle 2.008e+01 1.033e+04 0.002 0.998
## Luggage.room -3.425e+01 1.581e+04 -0.002 0.998
## Rear.seat.room 4.925e+00 9.499e+03 0.001 1.000
## Fuel.tank.capacity 1.534e+02 6.997e+04 0.002 0.998
## Passengers 3.109e+02 2.177e+05 0.001 0.999
## Man.trans.availYes 1.066e+02 1.826e+05 0.001 1.000
## Horsepower -3.041e-01 1.488e+03 0.000 1.000
## Cylinders4 1.382e+02 1.936e+05 0.001 0.999
## Cylinders5 -7.448e+02 6.183e+05 -0.001 0.999
## Cylinders6 1.578e+02 4.012e+05 0.000 1.000
## Cylinders8 4.875e+01 3.313e+05 0.000 1.000
## EngineSize -4.153e+02 2.473e+05 -0.002 0.999
## DriveTrainFront 2.837e+02 1.058e+05 0.003 0.998
## DriveTrainRear 9.040e+02 4.176e+05 0.002 0.998
## RPM 7.218e-02 1.087e+02 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.1363e+02 on 81 degrees of freedom
## Residual deviance: 5.1655e-08 on 52 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 60
##
## Number of Fisher Scoring iterations: 25
step<-stepAIC(glmres,direction='both');
## Start: AIC=60
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway +
## AirBags + Length + Wheelbase + Width + Weight + Turn.circle +
## Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers +
## Man.trans.avail + Horsepower + Cylinders + EngineSize + DriveTrain +
## RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Cylinders 4 0.00 52.00
## - Type 4 0.00 52.00
## - AirBags 2 0.00 56.00
## - Wheelbase 1 0.00 58.00
## - MPG.city 1 0.00 58.00
## - Horsepower 1 0.00 58.00
## - Rear.seat.room 1 0.00 58.00
## - Man.trans.avail 1 0.00 58.00
## - RPM 1 0.00 58.00
## - Length 1 0.00 58.00
## - MPG.highway 1 0.00 58.00
## - EngineSize 1 0.00 58.00
## - Width 1 0.00 58.00
## <none> 0.00 60.00
## - DriveTrain 2 17.57 73.57
## - Min.Price 1 16.86 74.86
## - Luggage.room 1 576.70 634.70
## - Weight 1 648.79 706.79
## - Turn.circle 1 648.79 706.79
## - Passengers 1 648.79 706.79
## - Max.Price 1 792.96 850.96
## - Fuel.tank.capacity 1 792.96 850.96
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=52
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway +
## AirBags + Length + Wheelbase + Width + Weight + Turn.circle +
## Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers +
## Man.trans.avail + Horsepower + EngineSize + DriveTrain +
## RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Type 4 0.00 44.00
## - AirBags 2 0.00 48.00
## - Wheelbase 1 0.00 50.00
## - Horsepower 1 0.00 50.00
## - RPM 1 0.00 50.00
## - Rear.seat.room 1 0.00 50.00
## - Man.trans.avail 1 0.00 50.00
## - MPG.city 1 0.00 50.00
## - Length 1 0.00 50.00
## - Width 1 0.00 50.00
## - MPG.highway 1 0.00 50.00
## - Weight 1 0.00 50.00
## - EngineSize 1 0.00 50.00
## - Fuel.tank.capacity 1 0.00 50.00
## <none> 0.00 52.00
## + Cylinders 4 0.00 60.00
## - DriveTrain 2 17.73 65.73
## - Min.Price 1 17.00 67.00
## - Turn.circle 1 576.70 626.70
## - Max.Price 1 720.87 770.87
## - Luggage.room 1 720.87 770.87
## - Passengers 1 720.87 770.87
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=44
## Origin ~ Min.Price + Max.Price + MPG.city + MPG.highway + AirBags +
## Length + Wheelbase + Width + Weight + Turn.circle + Luggage.room +
## Rear.seat.room + Fuel.tank.capacity + Passengers + Man.trans.avail +
## Horsepower + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - AirBags 2 0.00 40.00
## - MPG.city 1 0.00 42.00
## - Wheelbase 1 0.00 42.00
## - Horsepower 1 0.00 42.00
## - Rear.seat.room 1 0.00 42.00
## - Man.trans.avail 1 0.00 42.00
## - RPM 1 0.00 42.00
## - Passengers 1 0.00 42.00
## - Weight 1 0.00 42.00
## <none> 0.00 44.00
## + Type 4 0.00 52.00
## + Cylinders 4 0.00 52.00
## - DriveTrain 2 19.46 59.46
## - MPG.highway 1 18.24 60.24
## - EngineSize 1 23.00 65.00
## - Min.Price 1 24.58 66.58
## - Fuel.tank.capacity 1 41.64 83.64
## - Max.Price 1 648.79 690.79
## - Turn.circle 1 648.79 690.79
## - Width 1 792.96 834.96
## - Length 1 865.05 907.05
## - Luggage.room 1 937.13 979.13
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=40
## Origin ~ Min.Price + Max.Price + MPG.city + MPG.highway + Length +
## Wheelbase + Width + Weight + Turn.circle + Luggage.room +
## Rear.seat.room + Fuel.tank.capacity + Passengers + Man.trans.avail +
## Horsepower + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - MPG.city 1 0.00 38.00
## - Horsepower 1 0.00 38.00
## - Wheelbase 1 0.00 38.00
## - Passengers 1 0.00 38.00
## - Man.trans.avail 1 0.00 38.00
## - Weight 1 0.00 38.00
## - RPM 1 0.00 38.00
## - Luggage.room 1 0.00 38.00
## <none> 0.00 40.00
## + AirBags 2 0.00 44.00
## + Type 4 0.00 48.00
## + Cylinders 4 0.00 48.00
## - DriveTrain 2 20.17 56.17
## - Width 1 18.78 56.78
## - MPG.highway 1 20.93 58.93
## - Length 1 25.83 63.83
## - EngineSize 1 25.85 63.85
## - Min.Price 1 29.63 67.63
## - Fuel.tank.capacity 1 43.14 81.14
## - Rear.seat.room 1 576.70 614.70
## - Max.Price 1 648.79 686.79
## - Turn.circle 1 648.79 686.79
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=38
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Wheelbase +
## Width + Weight + Turn.circle + Luggage.room + Rear.seat.room +
## Fuel.tank.capacity + Passengers + Man.trans.avail + Horsepower +
## EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Horsepower 1 0.00 36.00
## - Man.trans.avail 1 0.00 36.00
## - Passengers 1 0.00 36.00
## - Rear.seat.room 1 0.00 36.00
## <none> 0.00 38.00
## + MPG.city 1 0.00 40.00
## + AirBags 2 0.00 42.00
## + Type 4 0.00 46.00
## - RPM 1 12.66 48.66
## - Luggage.room 1 13.62 49.62
## - DriveTrain 2 20.48 54.48
## - Width 1 21.18 57.18
## - Length 1 26.42 62.42
## - EngineSize 1 27.12 63.12
## - Min.Price 1 30.10 66.10
## - MPG.highway 1 33.46 69.46
## - Fuel.tank.capacity 1 43.14 79.14
## - Weight 1 432.52 468.52
## - Wheelbase 1 576.70 612.70
## - Max.Price 1 720.87 756.87
## - Turn.circle 1 792.96 828.96
## + Cylinders 4 792.96 838.96
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=36
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Wheelbase +
## Width + Weight + Turn.circle + Luggage.room + Rear.seat.room +
## Fuel.tank.capacity + Passengers + Man.trans.avail + EngineSize +
## DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Wheelbase 1 0.000 34.000
## - Man.trans.avail 1 0.000 34.000
## - Rear.seat.room 1 0.000 34.000
## <none> 0.000 36.000
## + Horsepower 1 0.000 38.000
## + MPG.city 1 0.000 38.000
## + AirBags 2 0.000 40.000
## + Type 4 0.000 44.000
## + Cylinders 4 0.000 44.000
## - Turn.circle 1 12.012 46.012
## - Max.Price 1 14.248 48.248
## - Luggage.room 1 14.310 48.310
## - RPM 1 18.532 52.532
## - DriveTrain 2 21.664 53.664
## - Passengers 1 20.892 54.892
## - Weight 1 22.410 56.410
## - Width 1 26.542 60.542
## - Length 1 29.571 63.571
## - Min.Price 1 30.102 64.102
## - EngineSize 1 32.639 66.639
## - MPG.highway 1 34.916 68.916
## - Fuel.tank.capacity 1 43.167 77.167
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=34
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width +
## Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity +
## Passengers + Man.trans.avail + EngineSize + DriveTrain +
## RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Man.trans.avail 1 0.00 32.00
## <none> 0.00 34.00
## + Wheelbase 1 0.00 36.00
## + AirBags 2 0.00 38.00
## + Type 4 0.00 42.00
## + Cylinders 4 0.00 42.00
## - Turn.circle 1 12.14 44.14
## - Max.Price 1 14.26 46.26
## - Luggage.room 1 16.30 48.30
## - Rear.seat.room 1 17.11 49.11
## - RPM 1 19.66 51.66
## - Passengers 1 20.97 52.97
## - Weight 1 22.42 54.42
## - DriveTrain 2 24.55 54.55
## - Width 1 29.44 61.44
## - Min.Price 1 30.52 62.52
## - EngineSize 1 34.61 66.61
## - Length 1 35.19 67.19
## - MPG.highway 1 35.40 67.40
## - Fuel.tank.capacity 1 43.23 75.23
## + Horsepower 1 576.70 612.70
## + MPG.city 1 720.87 756.87
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=32
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width +
## Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity +
## Passengers + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 0.000 32.000
## + Man.trans.avail 1 0.000 34.000
## + MPG.city 1 0.000 34.000
## + Wheelbase 1 0.000 34.000
## + Horsepower 1 0.000 34.000
## + AirBags 2 0.000 36.000
## + Type 4 0.000 40.000
## + Cylinders 4 0.000 40.000
## - Turn.circle 1 13.111 43.111
## - Max.Price 1 15.657 45.657
## - RPM 1 22.785 52.785
## - Rear.seat.room 1 23.245 53.245
## - DriveTrain 2 25.495 53.495
## - Passengers 1 24.097 54.097
## - Luggage.room 1 25.286 55.286
## - Width 1 30.549 60.549
## - Min.Price 1 30.572 60.572
## - Weight 1 31.250 61.250
## - Length 1 37.809 67.809
## - MPG.highway 1 38.322 68.322
## - EngineSize 1 40.832 70.832
## - Fuel.tank.capacity 1 45.981 75.981
step$anova;
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway +
## AirBags + Length + Wheelbase + Width + Weight + Turn.circle +
## Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers +
## Man.trans.avail + Horsepower + Cylinders + EngineSize + DriveTrain +
## RPM
##
## Final Model:
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width +
## Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity +
## Passengers + EngineSize + DriveTrain + RPM
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 52 5.165456e-08 60
## 2 - Cylinders 4 1.557887e-09 56 5.009667e-08 52
## 3 - Type 4 2.439268e-08 60 7.448935e-08 44
## 4 - AirBags 2 1.474398e-07 62 2.219291e-07 40
## 5 - MPG.city 1 1.712590e-10 63 2.221004e-07 38
## 6 - Horsepower 1 2.358373e-08 64 2.456841e-07 36
## 7 - Wheelbase 1 1.398773e-08 65 2.596719e-07 34
## 8 - Man.trans.avail 1 9.799847e-07 66 1.239657e-06 32
Variable selection is an important step in modeling building. There are many procedures of variable selection in machine learning and here we just are just getting a start. Stepwise procedure to select variables often does not give good results. Neither are stepwise procedures well justified by theory. I just want you to understand the process of variable selection, and there are many other options, which we wil learn later.
Now, I will select out some variables one at a time. This part is some what of an art. This is by no means optimal, but it sort of shows you how we select variables.
glmres<-glm(Origin ~ Min.Price + Length + Width + Weight + Fuel.tank.capacity + EngineSize + DriveTrain ,
data=Cars93, family=binomial(link=logit));
summary(glmres);
##
## Call:
## glm(formula = Origin ~ Min.Price + Length + Width + Weight +
## Fuel.tank.capacity + EngineSize + DriveTrain, family = binomial(link = logit),
## data = Cars93)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.21471 -0.32840 -0.00019 0.50030 2.31225
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 54.025850 18.264187 2.958 0.00310 **
## Min.Price 0.414948 0.139692 2.970 0.00297 **
## Length -0.128320 0.055196 -2.325 0.02008 *
## Width -0.819210 0.338757 -2.418 0.01559 *
## Weight 0.006425 0.002850 2.254 0.02419 *
## Fuel.tank.capacity 1.005976 0.438570 2.294 0.02180 *
## EngineSize -7.954927 2.534468 -3.139 0.00170 **
## DriveTrainFront 3.188878 1.443631 2.209 0.02718 *
## DriveTrainRear 2.743798 2.518951 1.089 0.27604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.829 on 92 degrees of freedom
## Residual deviance: 54.842 on 84 degrees of freedom
## AIC: 72.842
##
## Number of Fisher Scoring iterations: 8
# interpreting coefficient of Length
exp(-0.128320)
## [1] 0.8795719
# interpreting cofficient of Width
exp(-0.819210)
## [1] 0.4407797
# now trying fuel tank
exp(1.005976)
## [1] 2.734575
Because both Length and Width are < 1, larger length and width means it’s more likely to be a US car.
Since fuel tank > 1, bigger fuel tank means it’s more likely to be non-US car
So if the expondent < 1, it is more likely to be the category which is alphabetically later (US) If the exponsent > 1, it is morelikely to be the category which is first alphabetically (non-US)
US cars are larger in Length and Width, and Engine Size, and more likely 4WD. US cars also have smaller fuel tank and lower min price.
View(Cars93)
Convert the categorical variables as factors.
as.factor(Cars93$origin)
## factor(0)
## Levels:
View(Cars93)
# looking at factors in categorical variables
levels(Cars93$DriveTrain)
## [1] "4WD" "Front" "Rear"
levels(Cars93$AirBags)
## [1] "Driver & Passenger" "Driver only" "None"
Run vif to determine if there are problems with multicollinearity.
vif(glmres)
## GVIF Df GVIF^(1/(2*Df))
## Min.Price 5.598913 1 2.366202
## Length 3.510262 1 1.873569
## Width 7.141993 1 2.672451
## Weight 19.650156 1 4.432850
## Fuel.tank.capacity 12.070956 1 3.474328
## EngineSize 20.175956 1 4.491765
## DriveTrain 2.764046 2 1.289396
We should remove any variables with VIF greater then 10. So we should remove Weight, Fuel tank capacity, and Engine Size.
Often to assess model accuracy, ROC AUC is used. The higher the ROC AUC, the better. ROC AUC at 0.5 when model has no precitive power, while at ROC AUC near 1 the model is very accurate. Typical good models are in the range of 0.8 to 0.9.
library(pROC);
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
rocobj<-roc(Cars93$Origin, glmres$fitted.values);
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases
auc(rocobj);
## Area under the curve: 0.9458
coords(rocobj, "best", "threshold")$threshold
## [1] 0.5815575
Ideally, we need to segment the data and use train, validation and test data sets and ROC AUC should be assessed on test data set. Here, I just show how ROC AUC is done. When we are more familiar with data subsetting, we will come back on this again.
plot.roc(Cars93$Origin, glmres$fitted.values,
main="Confidence interval of a threshold", percent=TRUE,
ci=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.thres="best") # also highlight this threshold on the plot
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases
what am i looking at on this graph? the specificity and the sensitivity are both high
but really just look at the auc value if it is 1 it’s perfect if it’s 0.5 you still need to work on the model
rocobj<-roc(Cars93$Origin, glmres$fitted.values);
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases
rocobj
##
## Call:
## roc.default(response = Cars93$Origin, predictor = glmres$fitted.values)
##
## Data: glmres$fitted.values in 48 controls (Cars93$Origin USA) < 45 cases (Cars93$Origin non-USA).
## Area under the curve: 0.9458
We are comparing means between two groups: we do this 3 ways, the horrible manual way, the anova way, the Tukey’s way.
Analyzing a small table on two groups.
First read the data. (For convenience, I copied all data used in this homework assignment into an Excel file.)
Avoid this if possible– see next two problems for examples of getting the same answer using aov and Tukey’s
Calculate the group means and global mean, calculate MSA and MSW, and F. Use qf(0.95, df1=?, df2=?) to get the critical value for F. Is the difference significant?
Calculating df1 and df2: df1: # of citys/ groups - 1 df2: # of rows - # of citys/ groups
First read the data. (For convenience, I copied all data used in this homework assignment into an Excel file.)
* Note: you only need the Excel file for the horrible manual way, aov and Tukey’s will work without it * If you want the horrible manual way to work, put the Excel file HW1_Data.xlsx in the same folder as this R markdown file.
Calculate the group means
library(readxl)
df <- read_excel("HW1_Data.xlsx")
df
## # A tibble: 6 × 2
## Group Value
## <dbl> <dbl>
## 1 1 1.3
## 2 1 2.2
## 3 2 3.7
## 4 2 2.8
## 5 2 3.2
## 6 2 2.7
aggregate(df$Value, list(df$Group), FUN=mean)
## Group.1 x
## 1 1 1.75
## 2 2 3.10
add to our dataframe
df$GroupMean = c(1.75, 1.75, 3.10, 3.10, 3.10, 3.10)
df
## # A tibble: 6 × 3
## Group Value GroupMean
## <dbl> <dbl> <dbl>
## 1 1 1.3 1.75
## 2 1 2.2 1.75
## 3 2 3.7 3.1
## 4 2 2.8 3.1
## 5 2 3.2 3.1
## 6 2 2.7 3.1
Now calculate global mean
global_mean = mean(df$Value)
global_mean
## [1] 2.65
Now start calculating SSW
# first the diff between each value and its group mean
df$diff = df$Value - df$GroupMean
df
## # A tibble: 6 × 4
## Group Value GroupMean diff
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.3 1.75 -0.45
## 2 1 2.2 1.75 0.45
## 3 2 3.7 3.1 0.6
## 4 2 2.8 3.1 -0.300
## 5 2 3.2 3.1 0.100
## 6 2 2.7 3.1 -0.4
# now square each diff
df$diffsquared = df$diff^2
df
## # A tibble: 6 × 5
## Group Value GroupMean diff diffsquared
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.3 1.75 -0.45 0.202
## 2 1 2.2 1.75 0.45 0.203
## 3 2 3.7 3.1 0.6 0.36
## 4 2 2.8 3.1 -0.300 0.0900
## 5 2 3.2 3.1 0.100 0.0100
## 6 2 2.7 3.1 -0.4 0.16
finally calculate SSW
SSW = sum(df$diffsquared)
SSW
## [1] 1.025
now start calculating SSA
# grab the values we need into a new df and add the global mean
for_ssa = df[ , c('Group', 'Value')]
for_ssa$Mean = global_mean
for_ssa
## # A tibble: 6 × 3
## Group Value Mean
## <dbl> <dbl> <dbl>
## 1 1 1.3 2.65
## 2 1 2.2 2.65
## 3 2 3.7 2.65
## 4 2 2.8 2.65
## 5 2 3.2 2.65
## 6 2 2.7 2.65
# calculate each diff and square it
for_ssa$diff_squared = (for_ssa$Value - for_ssa$Mean)^2
for_ssa
## # A tibble: 6 × 4
## Group Value Mean diff_squared
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.3 2.65 1.82
## 2 1 2.2 2.65 0.202
## 3 2 3.7 2.65 1.10
## 4 2 2.8 2.65 0.0225
## 5 2 3.2 2.65 0.303
## 6 2 2.7 2.65 0.00250
# now calculate SSA
SSA = sum(for_ssa$diff_squared)
SSA
## [1] 3.455
Now calculate MSA and MSW
# MSA
n = 6
group1_n = 2
MSA = SSA/(n - group1_n)
MSA
## [1] 0.86375
# MSW
MSW = SSW/(2-1)
MSW
## [1] 1.025
Now calculate F
F = MSA/MSW
F
## [1] 0.8426829
Use qf(0.95, df1=?, df2=?) to get the critical value for F. Is the difference significant?
df1=2-1; #Degree of freedom for Mean Square Among = of groups - 1
df2=6-2; #Degree of freedom for Mean Square Within = of rows - # of groups
qf(0.95, df1, df2)
## [1] 7.708647
So our f value is less then the critical value of 7.7, which means this is signficant.
The F test is significant!
dataq2 = as.data.frame(list(c(1.3,2.2,3.7,2.8,3.2,2.7), c("Group 1","Group 1","Group 2","Group 2","Group 2","Group 2")));
colnames(dataq2)=c("Value", "Group");
dataq2
## Value Group
## 1 1.3 Group 1
## 2 2.2 Group 1
## 3 3.7 Group 2
## 4 2.8 Group 2
## 5 3.2 Group 2
## 6 2.7 Group 2
print("column names are: ")
## [1] "column names are: "
colnames(dataq2);
## [1] "Value" "Group"
print("column 1 is")
## [1] "column 1 is"
dataq2[,1];
## [1] 1.3 2.2 3.7 2.8 3.2 2.7
print("column 2 is")
## [1] "column 2 is"
dataq2[,2];
## [1] "Group 1" "Group 1" "Group 2" "Group 2" "Group 2" "Group 2"
model<-aov(Value~Group, dataq2);
summary(model);
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 2.430 2.4300 9.483 0.037 *
## Residuals 4 1.025 0.2562
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less then 5%, this is significant! Same answer as above but with much less work!
Again,avoid the horrible manual way if at all possible!!
TukeyHSD(model, conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Value ~ Group, data = dataq2)
##
## $Group
## diff lwr upr p adj
## Group 2-Group 1 1.35 0.1328235 2.567176 0.0369525
Interpretation: Because the pvalue is less the 5%, this is significant! We have statistically significant data to say that the groups are different.
datamatrix = matrix(c(470,3500,430,3300,150,1000),nrow=3, ncol=2, byrow=TRUE, dimnames=list(c('email v1', 'email v2', 'control'), c("clicked", "not clicked")))
datamatrix
## clicked not clicked
## email v1 470 3500
## email v2 430 3300
## control 150 1000
chisq.test(datamatrix)
##
## Pearson's Chi-squared test
##
## data: datamatrix
## X-squared = 1.9347, df = 2, p-value = 0.3801
# computing the new p-value with the bonferoni adjustment
# we need the bonferoni test because we have more then 2 categories
bon = choose(3,2) # 3 because we have 3 rows, 2 because it's binary
bon
## [1] 3
new_significance = 0.05/bon
new_significance
## [1] 0.01666667
Since the p-value of 0.3801 is higer then our new significance of 0.016666 this is not significant We cannot prove that there is a statistically significant differcen between the categories.
library(dplyr);
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
PATH <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv"
df <- read.csv(PATH) %>% dplyr::select(-X) %>% mutate(poison = factor(poison, ordered = TRUE));
glimpse(df);
## Rows: 48
## Columns: 3
## $ time <dbl> 0.31, 0.45, 0.46, 0.43, 0.36, 0.29, 0.40, 0.23, 0.22, 0.21, 0.1…
## $ poison <ord> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, …
## $ treat <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B"…
library(car); # Levene's test with one independent variable
leveneTest(time ~ poison, data = df); # p-value > 0.05 means variances are not significantly different
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 4.1964 0.02133 *
## 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(df[which(df$poison==1),]$time);
##
## Shapiro-Wilk normality test
##
## data: df[which(df$poison == 1), ]$time
## W = 0.93235, p-value = 0.2656
shapiro.test(df[which(df$poison==2),]$time);
##
## Shapiro-Wilk normality test
##
## data: df[which(df$poison == 2), ]$time
## W = 0.85254, p-value = 0.01485
shapiro.test(df[which(df$poison==3),]$time);
##
## Shapiro-Wilk normality test
##
## data: df[which(df$poison == 3), ]$time
## W = 0.93596, p-value = 0.3024
# p-value > 0.05 means distribution is not significantly different from a normal distribution
Conclusion: for this sample, homogeneity of variance is not satified and not all distributions are normal.
mydata<-matrix(c(50,950,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c(“treat”,“control”),c(“clicked”,“nonlicked”)))
Do a Chi Squared Test (with and without Yate’s Continuity Correction) and Fisher exact test on it. In addition, use simulate.p.value option with these tests. List out the results.
Is it significant?
If it is not significant, pick one test that you think is appropriate. Use that test to determine how much better the Treatment cell has to be (more clicks and fewer non clicks) without increasing the overall sample size, to get a significant test result? Write program in R to show the p-value as the Treatment sample improves. Make a plot of p-values vs sample clicks.
If the click through rates for the Treatment and Control cells maintain the same, what is the minimum sample size in order to get a significant test? Write program in R to show the p-value as the sample size changes. Make a plot of p-values vs sample size.
mydata<-matrix(c(50,950,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonlicked")));
chisq.test(mydata);
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata
## X-squared = 0.94241, df = 1, p-value = 0.3317
mydata
## clicked nonlicked
## treat 50 950
## control 40 960
Conclusion: Because the p-value is over 5%, the difference between groups is not significant.
The chi square test is not significant.
How high the treatment cell conversion rate has to be in order to get a significant test using the same sample size?
addclick<-1:100; xvalue<-seq(1:100); yvalue<-rep(1,100); #We have to initiate the variables to hold the x-values and y-values;
print("This is for test a higer conversion rate:");
## [1] "This is for test a higer conversion rate:"
istop=1;
for (i in addclick) {
mydata<-matrix(c(50+i,950-i,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonclicked")));
cst<-chisq.test(mydata);
print(mydata); print(cst$p.value);
istop=i;
xvalue[i]<-50+i;
yvalue[i]<-cst$p.value;
if (cst$p.value < 0.05) {break;}
}
## clicked nonclicked
## treat 51 949
## control 40 960
## [1] 0.2832804
## clicked nonclicked
## treat 52 948
## control 40 960
## [1] 0.2403339
## clicked nonclicked
## treat 53 947
## control 40 960
## [1] 0.2025491
## clicked nonclicked
## treat 54 946
## control 40 960
## [1] 0.1695927
## clicked nonclicked
## treat 55 945
## control 40 960
## [1] 0.1410887
## clicked nonclicked
## treat 56 944
## control 40 960
## [1] 0.116636
## clicked nonclicked
## treat 57 943
## control 40 960
## [1] 0.09582445
## clicked nonclicked
## treat 58 942
## control 40 960
## [1] 0.07824783
## clicked nonclicked
## treat 59 941
## control 40 960
## [1] 0.06351396
## clicked nonclicked
## treat 60 940
## control 40 960
## [1] 0.05125258
## clicked nonclicked
## treat 61 939
## control 40 960
## [1] 0.04112072
plot(xvalue[1:istop], yvalue[1:istop], xlab = "Number of Clicks Needed");
If the treatment conversion rates stay the same, how larger the samples have to be to make a significant test.
print("This is for test sample size:");
## [1] "This is for test sample size:"
addclick<-1:100; xvalue<-seq(1:100); yvalue<-rep(1,100);
istop=1;
for (i in addclick) {
mydata<-matrix(c(round(50*(1+i/10.0)),round(950*(1+i/10.0)),round(40*(1+i/10.0)), round(960*(1+i/10.0))),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonclicked")));
cst<-chisq.test(mydata);
print(mydata); print(cst$p.value);
istop=i;
xvalue[i]<-round(1000*(1+i/10.0));
yvalue[i]<-cst$p.value
if (cst$p.value < 0.05) {break;}
};
## clicked nonclicked
## treat 55 1045
## control 44 1056
## [1] 0.3037409
## clicked nonclicked
## treat 60 1140
## control 48 1152
## [1] 0.2787523
## clicked nonclicked
## treat 65 1235
## control 52 1248
## [1] 0.2562759
## clicked nonclicked
## treat 70 1330
## control 56 1344
## [1] 0.2359764
## clicked nonclicked
## treat 75 1425
## control 60 1440
## [1] 0.2175796
## clicked nonclicked
## treat 80 1520
## control 64 1536
## [1] 0.2008579
## clicked nonclicked
## treat 85 1615
## control 68 1632
## [1] 0.1856199
## clicked nonclicked
## treat 90 1710
## control 72 1728
## [1] 0.171703
## clicked nonclicked
## treat 95 1805
## control 76 1824
## [1] 0.1589678
## clicked nonclicked
## treat 100 1900
## control 80 1920
## [1] 0.1472935
## clicked nonclicked
## treat 105 1995
## control 84 2016
## [1] 0.1365752
## clicked nonclicked
## treat 110 2090
## control 88 2112
## [1] 0.1267208
## clicked nonclicked
## treat 115 2185
## control 92 2208
## [1] 0.1176493
## clicked nonclicked
## treat 120 2280
## control 96 2304
## [1] 0.1092888
## clicked nonclicked
## treat 125 2375
## control 100 2400
## [1] 0.1015756
## clicked nonclicked
## treat 130 2470
## control 104 2496
## [1] 0.09445288
## clicked nonclicked
## treat 135 2565
## control 108 2592
## [1] 0.08786954
## clicked nonclicked
## treat 140 2660
## control 112 2688
## [1] 0.08177985
## clicked nonclicked
## treat 145 2755
## control 116 2784
## [1] 0.07614258
## clicked nonclicked
## treat 150 2850
## control 120 2880
## [1] 0.07092049
## clicked nonclicked
## treat 155 2945
## control 124 2976
## [1] 0.06607989
## clicked nonclicked
## treat 160 3040
## control 128 3072
## [1] 0.06159021
## clicked nonclicked
## treat 165 3135
## control 132 3168
## [1] 0.05742367
## clicked nonclicked
## treat 170 3230
## control 136 3264
## [1] 0.05355497
## clicked nonclicked
## treat 175 3325
## control 140 3360
## [1] 0.04996104
plot(xvalue[1:istop], yvalue[1:istop],xlab = "Sample Size Needed");
Conclusion: we need treatment cell to have 61 clicks out of 1000 to get a singificant test; if we don’t change the conversion rates, we need a sample of 3500 out of which 175 will be clicks and 3325 non clicks to get a significant test.