Part 1: questions from the sample midterm

Creating a model, choosing predictor variables, fine-tuning

Logistic & Linear models

First, a linear model: creating it, choosing predictor variables, fine-tuning

library(car);
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.2
library(MASS);
## Warning: package 'MASS' was built under R version 4.1.2

Let’s take a look at the dataset we’ll be working with.

typeof(cement)
## [1] "list"
View(cement)
?cement

Let’s build a linear model for the MASS:cement data set

lmres<-lm(y~x1+x2+x3+x4, data=MASS::cement);
summary(lmres);
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = MASS::cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

We can see that none of the variables are well determined. This is because of the p values are larger than 0.05. Let’s remove one variable, x3 and see what happens.

lmres<-lm(y~x1+x2+x4, data=MASS::cement);
summary(lmres);
## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = MASS::cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0919 -1.8016  0.2562  1.2818  3.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.6483    14.1424   5.066 0.000675 ***
## x1            1.4519     0.1170  12.410 5.78e-07 ***
## x2            0.4161     0.1856   2.242 0.051687 .  
## x4           -0.2365     0.1733  -1.365 0.205395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

Let’s also remove x4

lmres<-lm(y~x1+x2, data=MASS::cement);
summary(lmres);
## 
## Call:
## lm(formula = y ~ x1 + x2, data = MASS::cement)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## x1           1.46831    0.12130   12.11 2.69e-07 ***
## x2           0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09

As we see, now the remaining coefficients are well determined. This says that the p-values of predictors depend on the presence of other variables. This is why we can’t just remove all variables whose p-value is larger than 0.05. StepAIC automatically selects variables for us.

lmres<-lm(y~x1+x2+x3+x4, data=MASS::cement);
step<-stepAIC(lmres, direction='both');
## Start:  AIC=26.94
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## + x3    1      0.11  47.86 26.944
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## y ~ x1 + x2 + x3 + x4
## 
## Final Model:
## y ~ x1 + x2 + x4
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                          8   47.86364 26.94429
## 2 - x3  1  0.10909         9   47.97273 24.97388

As we can see, the results from stepAIC are different from what we did. The final model from stepAIC looks problematic:

lmres<-lm(y~x1+x2+x4, data=MASS::cement);
summary(lmres);
## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = MASS::cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0919 -1.8016  0.2562  1.2818  3.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.6483    14.1424   5.066 0.000675 ***
## x1            1.4519     0.1170  12.410 5.78e-07 ***
## x2            0.4161     0.1856   2.242 0.051687 .  
## x4           -0.2365     0.1733  -1.365 0.205395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08

x2, and x4 are not well determined.

I still like y~x1+x2 better.

Logistical model: creating it, choosing predictor variables, fine-tuning it

library(MASS);
data("Cars93");
Cars93;
##     Manufacturer          Model    Type Min.Price Price Max.Price MPG.city
## 1          Acura        Integra   Small      12.9  15.9      18.8       25
## 2          Acura         Legend Midsize      29.2  33.9      38.7       18
## 3           Audi             90 Compact      25.9  29.1      32.3       20
## 4           Audi            100 Midsize      30.8  37.7      44.6       19
## 5            BMW           535i Midsize      23.7  30.0      36.2       22
## 6          Buick        Century Midsize      14.2  15.7      17.3       22
## 7          Buick        LeSabre   Large      19.9  20.8      21.7       19
## 8          Buick     Roadmaster   Large      22.6  23.7      24.9       16
## 9          Buick        Riviera Midsize      26.3  26.3      26.3       19
## 10      Cadillac        DeVille   Large      33.0  34.7      36.3       16
## 11      Cadillac        Seville Midsize      37.5  40.1      42.7       16
## 12     Chevrolet       Cavalier Compact       8.5  13.4      18.3       25
## 13     Chevrolet        Corsica Compact      11.4  11.4      11.4       25
## 14     Chevrolet         Camaro  Sporty      13.4  15.1      16.8       19
## 15     Chevrolet         Lumina Midsize      13.4  15.9      18.4       21
## 16     Chevrolet     Lumina_APV     Van      14.7  16.3      18.0       18
## 17     Chevrolet          Astro     Van      14.7  16.6      18.6       15
## 18     Chevrolet        Caprice   Large      18.0  18.8      19.6       17
## 19     Chevrolet       Corvette  Sporty      34.6  38.0      41.5       17
## 20      Chrylser       Concorde   Large      18.4  18.4      18.4       20
## 21      Chrysler        LeBaron Compact      14.5  15.8      17.1       23
## 22      Chrysler       Imperial   Large      29.5  29.5      29.5       20
## 23         Dodge           Colt   Small       7.9   9.2      10.6       29
## 24         Dodge         Shadow   Small       8.4  11.3      14.2       23
## 25         Dodge         Spirit Compact      11.9  13.3      14.7       22
## 26         Dodge        Caravan     Van      13.6  19.0      24.4       17
## 27         Dodge        Dynasty Midsize      14.8  15.6      16.4       21
## 28         Dodge        Stealth  Sporty      18.5  25.8      33.1       18
## 29         Eagle         Summit   Small       7.9  12.2      16.5       29
## 30         Eagle         Vision   Large      17.5  19.3      21.2       20
## 31          Ford        Festiva   Small       6.9   7.4       7.9       31
## 32          Ford         Escort   Small       8.4  10.1      11.9       23
## 33          Ford          Tempo Compact      10.4  11.3      12.2       22
## 34          Ford        Mustang  Sporty      10.8  15.9      21.0       22
## 35          Ford          Probe  Sporty      12.8  14.0      15.2       24
## 36          Ford       Aerostar     Van      14.5  19.9      25.3       15
## 37          Ford         Taurus Midsize      15.6  20.2      24.8       21
## 38          Ford Crown_Victoria   Large      20.1  20.9      21.7       18
## 39           Geo          Metro   Small       6.7   8.4      10.0       46
## 40           Geo          Storm  Sporty      11.5  12.5      13.5       30
## 41         Honda        Prelude  Sporty      17.0  19.8      22.7       24
## 42         Honda          Civic   Small       8.4  12.1      15.8       42
## 43         Honda         Accord Compact      13.8  17.5      21.2       24
## 44       Hyundai          Excel   Small       6.8   8.0       9.2       29
## 45       Hyundai        Elantra   Small       9.0  10.0      11.0       22
## 46       Hyundai         Scoupe  Sporty       9.1  10.0      11.0       26
## 47       Hyundai         Sonata Midsize      12.4  13.9      15.3       20
## 48      Infiniti            Q45 Midsize      45.4  47.9      50.4       17
## 49         Lexus          ES300 Midsize      27.5  28.0      28.4       18
## 50         Lexus          SC300 Midsize      34.7  35.2      35.6       18
## 51       Lincoln    Continental Midsize      33.3  34.3      35.3       17
## 52       Lincoln       Town_Car   Large      34.4  36.1      37.8       18
## 53         Mazda            323   Small       7.4   8.3       9.1       29
## 54         Mazda        Protege   Small      10.9  11.6      12.3       28
## 55         Mazda            626 Compact      14.3  16.5      18.7       26
## 56         Mazda            MPV     Van      16.6  19.1      21.7       18
## 57         Mazda           RX-7  Sporty      32.5  32.5      32.5       17
## 58 Mercedes-Benz           190E Compact      29.0  31.9      34.9       20
## 59 Mercedes-Benz           300E Midsize      43.8  61.9      80.0       19
## 60       Mercury          Capri  Sporty      13.3  14.1      15.0       23
## 61       Mercury         Cougar Midsize      14.9  14.9      14.9       19
## 62    Mitsubishi         Mirage   Small       7.7  10.3      12.9       29
## 63    Mitsubishi       Diamante Midsize      22.4  26.1      29.9       18
## 64        Nissan         Sentra   Small       8.7  11.8      14.9       29
## 65        Nissan         Altima Compact      13.0  15.7      18.3       24
## 66        Nissan          Quest     Van      16.7  19.1      21.5       17
## 67        Nissan         Maxima Midsize      21.0  21.5      22.0       21
## 68    Oldsmobile        Achieva Compact      13.0  13.5      14.0       24
## 69    Oldsmobile  Cutlass_Ciera Midsize      14.2  16.3      18.4       23
## 70    Oldsmobile     Silhouette     Van      19.5  19.5      19.5       18
## 71    Oldsmobile   Eighty-Eight   Large      19.5  20.7      21.9       19
## 72      Plymouth          Laser  Sporty      11.4  14.4      17.4       23
## 73       Pontiac         LeMans   Small       8.2   9.0       9.9       31
## 74       Pontiac        Sunbird Compact       9.4  11.1      12.8       23
## 75       Pontiac       Firebird  Sporty      14.0  17.7      21.4       19
## 76       Pontiac     Grand_Prix Midsize      15.4  18.5      21.6       19
## 77       Pontiac     Bonneville   Large      19.4  24.4      29.4       19
## 78          Saab            900 Compact      20.3  28.7      37.1       20
## 79        Saturn             SL   Small       9.2  11.1      12.9       28
## 80        Subaru          Justy   Small       7.3   8.4       9.5       33
## 81        Subaru         Loyale   Small      10.5  10.9      11.3       25
## 82        Subaru         Legacy Compact      16.3  19.5      22.7       23
## 83        Suzuki          Swift   Small       7.3   8.6      10.0       39
## 84        Toyota         Tercel   Small       7.8   9.8      11.8       32
## 85        Toyota         Celica  Sporty      14.2  18.4      22.6       25
## 86        Toyota          Camry Midsize      15.2  18.2      21.2       22
## 87        Toyota         Previa     Van      18.9  22.7      26.6       18
## 88    Volkswagen            Fox   Small       8.7   9.1       9.5       25
## 89    Volkswagen        Eurovan     Van      16.6  19.7      22.7       17
## 90    Volkswagen         Passat Compact      17.6  20.0      22.4       21
## 91    Volkswagen        Corrado  Sporty      22.9  23.3      23.7       18
## 92         Volvo            240 Compact      21.8  22.7      23.5       21
## 93         Volvo            850 Midsize      24.8  26.7      28.5       20
##    MPG.highway            AirBags DriveTrain Cylinders EngineSize Horsepower
## 1           31               None      Front         4        1.8        140
## 2           25 Driver & Passenger      Front         6        3.2        200
## 3           26        Driver only      Front         6        2.8        172
## 4           26 Driver & Passenger      Front         6        2.8        172
## 5           30        Driver only       Rear         4        3.5        208
## 6           31        Driver only      Front         4        2.2        110
## 7           28        Driver only      Front         6        3.8        170
## 8           25        Driver only       Rear         6        5.7        180
## 9           27        Driver only      Front         6        3.8        170
## 10          25        Driver only      Front         8        4.9        200
## 11          25 Driver & Passenger      Front         8        4.6        295
## 12          36               None      Front         4        2.2        110
## 13          34        Driver only      Front         4        2.2        110
## 14          28 Driver & Passenger       Rear         6        3.4        160
## 15          29               None      Front         4        2.2        110
## 16          23               None      Front         6        3.8        170
## 17          20               None        4WD         6        4.3        165
## 18          26        Driver only       Rear         8        5.0        170
## 19          25        Driver only       Rear         8        5.7        300
## 20          28 Driver & Passenger      Front         6        3.3        153
## 21          28 Driver & Passenger      Front         4        3.0        141
## 22          26        Driver only      Front         6        3.3        147
## 23          33               None      Front         4        1.5         92
## 24          29        Driver only      Front         4        2.2         93
## 25          27        Driver only      Front         4        2.5        100
## 26          21        Driver only        4WD         6        3.0        142
## 27          27        Driver only      Front         4        2.5        100
## 28          24        Driver only        4WD         6        3.0        300
## 29          33               None      Front         4        1.5         92
## 30          28 Driver & Passenger      Front         6        3.5        214
## 31          33               None      Front         4        1.3         63
## 32          30               None      Front         4        1.8        127
## 33          27               None      Front         4        2.3         96
## 34          29        Driver only       Rear         4        2.3        105
## 35          30        Driver only      Front         4        2.0        115
## 36          20        Driver only        4WD         6        3.0        145
## 37          30        Driver only      Front         6        3.0        140
## 38          26        Driver only       Rear         8        4.6        190
## 39          50               None      Front         3        1.0         55
## 40          36        Driver only      Front         4        1.6         90
## 41          31 Driver & Passenger      Front         4        2.3        160
## 42          46        Driver only      Front         4        1.5        102
## 43          31 Driver & Passenger      Front         4        2.2        140
## 44          33               None      Front         4        1.5         81
## 45          29               None      Front         4        1.8        124
## 46          34               None      Front         4        1.5         92
## 47          27               None      Front         4        2.0        128
## 48          22        Driver only       Rear         8        4.5        278
## 49          24        Driver only      Front         6        3.0        185
## 50          23 Driver & Passenger       Rear         6        3.0        225
## 51          26 Driver & Passenger      Front         6        3.8        160
## 52          26 Driver & Passenger       Rear         8        4.6        210
## 53          37               None      Front         4        1.6         82
## 54          36               None      Front         4        1.8        103
## 55          34        Driver only      Front         4        2.5        164
## 56          24               None        4WD         6        3.0        155
## 57          25        Driver only       Rear    rotary        1.3        255
## 58          29        Driver only       Rear         4        2.3        130
## 59          25 Driver & Passenger       Rear         6        3.2        217
## 60          26        Driver only      Front         4        1.6        100
## 61          26               None       Rear         6        3.8        140
## 62          33               None      Front         4        1.5         92
## 63          24        Driver only      Front         6        3.0        202
## 64          33        Driver only      Front         4        1.6        110
## 65          30        Driver only      Front         4        2.4        150
## 66          23               None      Front         6        3.0        151
## 67          26        Driver only      Front         6        3.0        160
## 68          31               None      Front         4        2.3        155
## 69          31        Driver only      Front         4        2.2        110
## 70          23               None      Front         6        3.8        170
## 71          28        Driver only      Front         6        3.8        170
## 72          30               None        4WD         4        1.8         92
## 73          41               None      Front         4        1.6         74
## 74          31               None      Front         4        2.0        110
## 75          28 Driver & Passenger       Rear         6        3.4        160
## 76          27               None      Front         6        3.4        200
## 77          28 Driver & Passenger      Front         6        3.8        170
## 78          26        Driver only      Front         4        2.1        140
## 79          38        Driver only      Front         4        1.9         85
## 80          37               None        4WD         3        1.2         73
## 81          30               None        4WD         4        1.8         90
## 82          30        Driver only        4WD         4        2.2        130
## 83          43               None      Front         3        1.3         70
## 84          37        Driver only      Front         4        1.5         82
## 85          32        Driver only      Front         4        2.2        135
## 86          29        Driver only      Front         4        2.2        130
## 87          22        Driver only        4WD         4        2.4        138
## 88          33               None      Front         4        1.8         81
## 89          21               None      Front         5        2.5        109
## 90          30               None      Front         4        2.0        134
## 91          25               None      Front         6        2.8        178
## 92          28        Driver only       Rear         4        2.3        114
## 93          28 Driver & Passenger      Front         5        2.4        168
##     RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
## 1  6300         2890             Yes               13.2          5    177
## 2  5500         2335             Yes               18.0          5    195
## 3  5500         2280             Yes               16.9          5    180
## 4  5500         2535             Yes               21.1          6    193
## 5  5700         2545             Yes               21.1          4    186
## 6  5200         2565              No               16.4          6    189
## 7  4800         1570              No               18.0          6    200
## 8  4000         1320              No               23.0          6    216
## 9  4800         1690              No               18.8          5    198
## 10 4100         1510              No               18.0          6    206
## 11 6000         1985              No               20.0          5    204
## 12 5200         2380             Yes               15.2          5    182
## 13 5200         2665             Yes               15.6          5    184
## 14 4600         1805             Yes               15.5          4    193
## 15 5200         2595              No               16.5          6    198
## 16 4800         1690              No               20.0          7    178
## 17 4000         1790              No               27.0          8    194
## 18 4200         1350              No               23.0          6    214
## 19 5000         1450             Yes               20.0          2    179
## 20 5300         1990              No               18.0          6    203
## 21 5000         2090              No               16.0          6    183
## 22 4800         1785              No               16.0          6    203
## 23 6000         3285             Yes               13.2          5    174
## 24 4800         2595             Yes               14.0          5    172
## 25 4800         2535             Yes               16.0          6    181
## 26 5000         1970              No               20.0          7    175
## 27 4800         2465              No               16.0          6    192
## 28 6000         2120             Yes               19.8          4    180
## 29 6000         2505             Yes               13.2          5    174
## 30 5800         1980              No               18.0          6    202
## 31 5000         3150             Yes               10.0          4    141
## 32 6500         2410             Yes               13.2          5    171
## 33 4200         2805             Yes               15.9          5    177
## 34 4600         2285             Yes               15.4          4    180
## 35 5500         2340             Yes               15.5          4    179
## 36 4800         2080             Yes               21.0          7    176
## 37 4800         1885              No               16.0          5    192
## 38 4200         1415              No               20.0          6    212
## 39 5700         3755             Yes               10.6          4    151
## 40 5400         3250             Yes               12.4          4    164
## 41 5800         2855             Yes               15.9          4    175
## 42 5900         2650             Yes               11.9          4    173
## 43 5600         2610             Yes               17.0          4    185
## 44 5500         2710             Yes               11.9          5    168
## 45 6000         2745             Yes               13.7          5    172
## 46 5550         2540             Yes               11.9          4    166
## 47 6000         2335             Yes               17.2          5    184
## 48 6000         1955              No               22.5          5    200
## 49 5200         2325             Yes               18.5          5    188
## 50 6000         2510             Yes               20.6          4    191
## 51 4400         1835              No               18.4          6    205
## 52 4600         1840              No               20.0          6    219
## 53 5000         2370             Yes               13.2          4    164
## 54 5500         2220             Yes               14.5          5    172
## 55 5600         2505             Yes               15.5          5    184
## 56 5000         2240              No               19.6          7    190
## 57 6500         2325             Yes               20.0          2    169
## 58 5100         2425             Yes               14.5          5    175
## 59 5500         2220              No               18.5          5    187
## 60 5750         2475             Yes               11.1          4    166
## 61 3800         1730              No               18.0          5    199
## 62 6000         2505             Yes               13.2          5    172
## 63 6000         2210              No               19.0          5    190
## 64 6000         2435             Yes               13.2          5    170
## 65 5600         2130             Yes               15.9          5    181
## 66 4800         2065              No               20.0          7    190
## 67 5200         2045              No               18.5          5    188
## 68 6000         2380              No               15.2          5    188
## 69 5200         2565              No               16.5          5    190
## 70 4800         1690              No               20.0          7    194
## 71 4800         1570              No               18.0          6    201
## 72 5000         2360             Yes               15.9          4    173
## 73 5600         3130             Yes               13.2          4    177
## 74 5200         2665             Yes               15.2          5    181
## 75 4600         1805             Yes               15.5          4    196
## 76 5000         1890             Yes               16.5          5    195
## 77 4800         1565              No               18.0          6    177
## 78 6000         2910             Yes               18.0          5    184
## 79 5000         2145             Yes               12.8          5    176
## 80 5600         2875             Yes                9.2          4    146
## 81 5200         3375             Yes               15.9          5    175
## 82 5600         2330             Yes               15.9          5    179
## 83 6000         3360             Yes               10.6          4    161
## 84 5200         3505             Yes               11.9          5    162
## 85 5400         2405             Yes               15.9          4    174
## 86 5400         2340             Yes               18.5          5    188
## 87 5000         2515             Yes               19.8          7    187
## 88 5500         2550             Yes               12.4          4    163
## 89 4500         2915             Yes               21.1          7    187
## 90 5800         2685             Yes               18.5          5    180
## 91 5800         2385             Yes               18.5          4    159
## 92 5400         2215             Yes               15.8          5    190
## 93 6200         2310             Yes               19.3          5    184
##    Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
## 1        102    68          37           26.5           11   2705 non-USA
## 2        115    71          38           30.0           15   3560 non-USA
## 3        102    67          37           28.0           14   3375 non-USA
## 4        106    70          37           31.0           17   3405 non-USA
## 5        109    69          39           27.0           13   3640 non-USA
## 6        105    69          41           28.0           16   2880     USA
## 7        111    74          42           30.5           17   3470     USA
## 8        116    78          45           30.5           21   4105     USA
## 9        108    73          41           26.5           14   3495     USA
## 10       114    73          43           35.0           18   3620     USA
## 11       111    74          44           31.0           14   3935     USA
## 12       101    66          38           25.0           13   2490     USA
## 13       103    68          39           26.0           14   2785     USA
## 14       101    74          43           25.0           13   3240     USA
## 15       108    71          40           28.5           16   3195     USA
## 16       110    74          44           30.5           NA   3715     USA
## 17       111    78          42           33.5           NA   4025     USA
## 18       116    77          42           29.5           20   3910     USA
## 19        96    74          43             NA           NA   3380     USA
## 20       113    74          40           31.0           15   3515     USA
## 21       104    68          41           30.5           14   3085     USA
## 22       110    69          44           36.0           17   3570     USA
## 23        98    66          32           26.5           11   2270     USA
## 24        97    67          38           26.5           13   2670     USA
## 25       104    68          39           30.5           14   2970     USA
## 26       112    72          42           26.5           NA   3705     USA
## 27       105    69          42           30.5           16   3080     USA
## 28        97    72          40           20.0           11   3805     USA
## 29        98    66          36           26.5           11   2295     USA
## 30       113    74          40           30.0           15   3490     USA
## 31        90    63          33           26.0           12   1845     USA
## 32        98    67          36           28.0           12   2530     USA
## 33       100    68          39           27.5           13   2690     USA
## 34       101    68          40           24.0           12   2850     USA
## 35       103    70          38           23.0           18   2710     USA
## 36       119    72          45           30.0           NA   3735     USA
## 37       106    71          40           27.5           18   3325     USA
## 38       114    78          43           30.0           21   3950     USA
## 39        93    63          34           27.5           10   1695 non-USA
## 40        97    67          37           24.5           11   2475 non-USA
## 41       100    70          39           23.5            8   2865 non-USA
## 42       103    67          36           28.0           12   2350 non-USA
## 43       107    67          41           28.0           14   3040 non-USA
## 44        94    63          35           26.0           11   2345 non-USA
## 45        98    66          36           28.0           12   2620 non-USA
## 46        94    64          34           23.5            9   2285 non-USA
## 47       104    69          41           31.0           14   2885 non-USA
## 48       113    72          42           29.0           15   4000 non-USA
## 49       103    70          40           27.5           14   3510 non-USA
## 50       106    71          39           25.0            9   3515 non-USA
## 51       109    73          42           30.0           19   3695     USA
## 52       117    77          45           31.5           22   4055     USA
## 53        97    66          34           27.0           16   2325 non-USA
## 54        98    66          36           26.5           13   2440 non-USA
## 55       103    69          40           29.5           14   2970 non-USA
## 56       110    72          39           27.5           NA   3735 non-USA
## 57        96    69          37             NA           NA   2895 non-USA
## 58       105    67          34           26.0           12   2920 non-USA
## 59       110    69          37           27.0           15   3525 non-USA
## 60        95    65          36           19.0            6   2450     USA
## 61       113    73          38           28.0           15   3610     USA
## 62        98    67          36           26.0           11   2295 non-USA
## 63       107    70          43           27.5           14   3730 non-USA
## 64        96    66          33           26.0           12   2545 non-USA
## 65       103    67          40           28.5           14   3050 non-USA
## 66       112    74          41           27.0           NA   4100 non-USA
## 67       104    69          41           28.5           14   3200 non-USA
## 68       103    67          39           28.0           14   2910     USA
## 69       105    70          42           28.0           16   2890     USA
## 70       110    74          44           30.5           NA   3715     USA
## 71       111    74          42           31.5           17   3470     USA
## 72        97    67          39           24.5            8   2640     USA
## 73        99    66          35           25.5           17   2350     USA
## 74       101    66          39           25.0           13   2575     USA
## 75       101    75          43           25.0           13   3240     USA
## 76       108    72          41           28.5           16   3450     USA
## 77       111    74          43           30.5           18   3495     USA
## 78        99    67          37           26.5           14   2775 non-USA
## 79       102    68          40           26.5           12   2495     USA
## 80        90    60          32           23.5           10   2045 non-USA
## 81        97    65          35           27.5           15   2490 non-USA
## 82       102    67          37           27.0           14   3085 non-USA
## 83        93    63          34           27.5           10   1965 non-USA
## 84        94    65          36           24.0           11   2055 non-USA
## 85        99    69          39           23.0           13   2950 non-USA
## 86       103    70          38           28.5           15   3030 non-USA
## 87       113    71          41           35.0           NA   3785 non-USA
## 88        93    63          34           26.0           10   2240 non-USA
## 89       115    72          38           34.0           NA   3960 non-USA
## 90       103    67          35           31.5           14   2985 non-USA
## 91        97    66          36           26.0           15   2810 non-USA
## 92       104    67          37           29.5           14   2985 non-USA
## 93       105    69          38           30.0           15   3245 non-USA
##                        Make
## 1             Acura Integra
## 2              Acura Legend
## 3                   Audi 90
## 4                  Audi 100
## 5                  BMW 535i
## 6             Buick Century
## 7             Buick LeSabre
## 8          Buick Roadmaster
## 9             Buick Riviera
## 10         Cadillac DeVille
## 11         Cadillac Seville
## 12       Chevrolet Cavalier
## 13        Chevrolet Corsica
## 14         Chevrolet Camaro
## 15         Chevrolet Lumina
## 16     Chevrolet Lumina_APV
## 17          Chevrolet Astro
## 18        Chevrolet Caprice
## 19       Chevrolet Corvette
## 20        Chrylser Concorde
## 21         Chrysler LeBaron
## 22        Chrysler Imperial
## 23               Dodge Colt
## 24             Dodge Shadow
## 25             Dodge Spirit
## 26            Dodge Caravan
## 27            Dodge Dynasty
## 28            Dodge Stealth
## 29             Eagle Summit
## 30             Eagle Vision
## 31             Ford Festiva
## 32              Ford Escort
## 33               Ford Tempo
## 34             Ford Mustang
## 35               Ford Probe
## 36            Ford Aerostar
## 37              Ford Taurus
## 38      Ford Crown_Victoria
## 39                Geo Metro
## 40                Geo Storm
## 41            Honda Prelude
## 42              Honda Civic
## 43             Honda Accord
## 44            Hyundai Excel
## 45          Hyundai Elantra
## 46           Hyundai Scoupe
## 47           Hyundai Sonata
## 48             Infiniti Q45
## 49              Lexus ES300
## 50              Lexus SC300
## 51      Lincoln Continental
## 52         Lincoln Town_Car
## 53                Mazda 323
## 54            Mazda Protege
## 55                Mazda 626
## 56                Mazda MPV
## 57               Mazda RX-7
## 58       Mercedes-Benz 190E
## 59       Mercedes-Benz 300E
## 60            Mercury Capri
## 61           Mercury Cougar
## 62        Mitsubishi Mirage
## 63      Mitsubishi Diamante
## 64            Nissan Sentra
## 65            Nissan Altima
## 66             Nissan Quest
## 67            Nissan Maxima
## 68       Oldsmobile Achieva
## 69 Oldsmobile Cutlass_Ciera
## 70    Oldsmobile Silhouette
## 71  Oldsmobile Eighty-Eight
## 72           Plymouth Laser
## 73           Pontiac LeMans
## 74          Pontiac Sunbird
## 75         Pontiac Firebird
## 76       Pontiac Grand_Prix
## 77       Pontiac Bonneville
## 78                 Saab 900
## 79                Saturn SL
## 80             Subaru Justy
## 81            Subaru Loyale
## 82            Subaru Legacy
## 83             Suzuki Swift
## 84            Toyota Tercel
## 85            Toyota Celica
## 86             Toyota Camry
## 87            Toyota Previa
## 88           Volkswagen Fox
## 89       Volkswagen Eurovan
## 90        Volkswagen Passat
## 91       Volkswagen Corrado
## 92                Volvo 240
## 93                Volvo 850

Here is an example for logistic regressions. Some of the steps are similar. In this case, the data set is already built and the glm function can build a model directly if there are two classes. There are some warnings, which are cased by some combination of your predictors we observed only successes of failures. That is a quite common phenomenon for logistic regression in small samples or if you have large parameter spaces.

Make sure to put in the link=logit and family=binomial! For the prediction, the model decides what is 1 and 0 based on alphabetical order so in this case, 1 is USA, 0 is non-USA

glmres<-glm(Origin~Min.Price+Max.Price+MPG.city+Type+MPG.highway+AirBags+Length+Wheelbase+Width+Weight+Turn.circle+Luggage.room + Rear.seat.room+Fuel.tank.capacity+Passengers+Man.trans.avail+Horsepower+Cylinders
            +EngineSize+DriveTrain+RPM,data=Cars93, family=binomial(link=logit));
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glmres);
## 
## Call:
## glm(formula = Origin ~ Min.Price + Max.Price + MPG.city + Type + 
##     MPG.highway + AirBags + Length + Wheelbase + Width + Weight + 
##     Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity + 
##     Passengers + Man.trans.avail + Horsepower + Cylinders + EngineSize + 
##     DriveTrain + RPM, family = binomial(link = logit), data = Cars93)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -6.593e-05  -2.100e-08  -2.100e-08   2.100e-08   8.239e-05  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)         6.158e+02  1.311e+06   0.000    1.000
## Min.Price           5.881e+01  1.620e+04   0.004    0.997
## Max.Price          -1.721e+01  8.233e+03  -0.002    0.998
## MPG.city            5.212e-01  3.380e+04   0.000    1.000
## TypeLarge          -9.275e+01  4.497e+05   0.000    1.000
## TypeMidsize        -4.235e+01  2.879e+05   0.000    1.000
## TypeSmall           9.118e+01  1.835e+05   0.000    1.000
## TypeSporty          7.397e+01  2.312e+05   0.000    1.000
## MPG.highway         6.467e+01  5.724e+04   0.001    0.999
## AirBagsDriver only -5.501e+01  2.025e+05   0.000    1.000
## AirBagsNone         1.563e+02  1.515e+05   0.001    0.999
## Length             -2.885e+01  1.828e+04  -0.002    0.999
## Wheelbase          -1.380e+00  3.317e+04   0.000    1.000
## Width              -6.534e+01  1.763e+04  -0.004    0.997
## Weight              8.224e-01  5.387e+02   0.002    0.999
## Turn.circle         2.008e+01  1.033e+04   0.002    0.998
## Luggage.room       -3.425e+01  1.581e+04  -0.002    0.998
## Rear.seat.room      4.925e+00  9.499e+03   0.001    1.000
## Fuel.tank.capacity  1.534e+02  6.997e+04   0.002    0.998
## Passengers          3.109e+02  2.177e+05   0.001    0.999
## Man.trans.availYes  1.066e+02  1.826e+05   0.001    1.000
## Horsepower         -3.041e-01  1.488e+03   0.000    1.000
## Cylinders4          1.382e+02  1.936e+05   0.001    0.999
## Cylinders5         -7.448e+02  6.183e+05  -0.001    0.999
## Cylinders6          1.578e+02  4.012e+05   0.000    1.000
## Cylinders8          4.875e+01  3.313e+05   0.000    1.000
## EngineSize         -4.153e+02  2.473e+05  -0.002    0.999
## DriveTrainFront     2.837e+02  1.058e+05   0.003    0.998
## DriveTrainRear      9.040e+02  4.176e+05   0.002    0.998
## RPM                 7.218e-02  1.087e+02   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.1363e+02  on 81  degrees of freedom
## Residual deviance: 5.1655e-08  on 52  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 60
## 
## Number of Fisher Scoring iterations: 25
step<-stepAIC(glmres,direction='both');
## Start:  AIC=60
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway + 
##     AirBags + Length + Wheelbase + Width + Weight + Turn.circle + 
##     Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers + 
##     Man.trans.avail + Horsepower + Cylinders + EngineSize + DriveTrain + 
##     RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - Cylinders           4     0.00  52.00
## - Type                4     0.00  52.00
## - AirBags             2     0.00  56.00
## - Wheelbase           1     0.00  58.00
## - MPG.city            1     0.00  58.00
## - Horsepower          1     0.00  58.00
## - Rear.seat.room      1     0.00  58.00
## - Man.trans.avail     1     0.00  58.00
## - RPM                 1     0.00  58.00
## - Length              1     0.00  58.00
## - MPG.highway         1     0.00  58.00
## - EngineSize          1     0.00  58.00
## - Width               1     0.00  58.00
## <none>                      0.00  60.00
## - DriveTrain          2    17.57  73.57
## - Min.Price           1    16.86  74.86
## - Luggage.room        1   576.70 634.70
## - Weight              1   648.79 706.79
## - Turn.circle         1   648.79 706.79
## - Passengers          1   648.79 706.79
## - Max.Price           1   792.96 850.96
## - Fuel.tank.capacity  1   792.96 850.96
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=52
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway + 
##     AirBags + Length + Wheelbase + Width + Weight + Turn.circle + 
##     Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers + 
##     Man.trans.avail + Horsepower + EngineSize + DriveTrain + 
##     RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - Type                4     0.00  44.00
## - AirBags             2     0.00  48.00
## - Wheelbase           1     0.00  50.00
## - Horsepower          1     0.00  50.00
## - RPM                 1     0.00  50.00
## - Rear.seat.room      1     0.00  50.00
## - Man.trans.avail     1     0.00  50.00
## - MPG.city            1     0.00  50.00
## - Length              1     0.00  50.00
## - Width               1     0.00  50.00
## - MPG.highway         1     0.00  50.00
## - Weight              1     0.00  50.00
## - EngineSize          1     0.00  50.00
## - Fuel.tank.capacity  1     0.00  50.00
## <none>                      0.00  52.00
## + Cylinders           4     0.00  60.00
## - DriveTrain          2    17.73  65.73
## - Min.Price           1    17.00  67.00
## - Turn.circle         1   576.70 626.70
## - Max.Price           1   720.87 770.87
## - Luggage.room        1   720.87 770.87
## - Passengers          1   720.87 770.87
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=44
## Origin ~ Min.Price + Max.Price + MPG.city + MPG.highway + AirBags + 
##     Length + Wheelbase + Width + Weight + Turn.circle + Luggage.room + 
##     Rear.seat.room + Fuel.tank.capacity + Passengers + Man.trans.avail + 
##     Horsepower + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - AirBags             2     0.00  40.00
## - MPG.city            1     0.00  42.00
## - Wheelbase           1     0.00  42.00
## - Horsepower          1     0.00  42.00
## - Rear.seat.room      1     0.00  42.00
## - Man.trans.avail     1     0.00  42.00
## - RPM                 1     0.00  42.00
## - Passengers          1     0.00  42.00
## - Weight              1     0.00  42.00
## <none>                      0.00  44.00
## + Type                4     0.00  52.00
## + Cylinders           4     0.00  52.00
## - DriveTrain          2    19.46  59.46
## - MPG.highway         1    18.24  60.24
## - EngineSize          1    23.00  65.00
## - Min.Price           1    24.58  66.58
## - Fuel.tank.capacity  1    41.64  83.64
## - Max.Price           1   648.79 690.79
## - Turn.circle         1   648.79 690.79
## - Width               1   792.96 834.96
## - Length              1   865.05 907.05
## - Luggage.room        1   937.13 979.13
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=40
## Origin ~ Min.Price + Max.Price + MPG.city + MPG.highway + Length + 
##     Wheelbase + Width + Weight + Turn.circle + Luggage.room + 
##     Rear.seat.room + Fuel.tank.capacity + Passengers + Man.trans.avail + 
##     Horsepower + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - MPG.city            1     0.00  38.00
## - Horsepower          1     0.00  38.00
## - Wheelbase           1     0.00  38.00
## - Passengers          1     0.00  38.00
## - Man.trans.avail     1     0.00  38.00
## - Weight              1     0.00  38.00
## - RPM                 1     0.00  38.00
## - Luggage.room        1     0.00  38.00
## <none>                      0.00  40.00
## + AirBags             2     0.00  44.00
## + Type                4     0.00  48.00
## + Cylinders           4     0.00  48.00
## - DriveTrain          2    20.17  56.17
## - Width               1    18.78  56.78
## - MPG.highway         1    20.93  58.93
## - Length              1    25.83  63.83
## - EngineSize          1    25.85  63.85
## - Min.Price           1    29.63  67.63
## - Fuel.tank.capacity  1    43.14  81.14
## - Rear.seat.room      1   576.70 614.70
## - Max.Price           1   648.79 686.79
## - Turn.circle         1   648.79 686.79
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=38
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Wheelbase + 
##     Width + Weight + Turn.circle + Luggage.room + Rear.seat.room + 
##     Fuel.tank.capacity + Passengers + Man.trans.avail + Horsepower + 
##     EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - Horsepower          1     0.00  36.00
## - Man.trans.avail     1     0.00  36.00
## - Passengers          1     0.00  36.00
## - Rear.seat.room      1     0.00  36.00
## <none>                      0.00  38.00
## + MPG.city            1     0.00  40.00
## + AirBags             2     0.00  42.00
## + Type                4     0.00  46.00
## - RPM                 1    12.66  48.66
## - Luggage.room        1    13.62  49.62
## - DriveTrain          2    20.48  54.48
## - Width               1    21.18  57.18
## - Length              1    26.42  62.42
## - EngineSize          1    27.12  63.12
## - Min.Price           1    30.10  66.10
## - MPG.highway         1    33.46  69.46
## - Fuel.tank.capacity  1    43.14  79.14
## - Weight              1   432.52 468.52
## - Wheelbase           1   576.70 612.70
## - Max.Price           1   720.87 756.87
## - Turn.circle         1   792.96 828.96
## + Cylinders           4   792.96 838.96
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=36
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Wheelbase + 
##     Width + Weight + Turn.circle + Luggage.room + Rear.seat.room + 
##     Fuel.tank.capacity + Passengers + Man.trans.avail + EngineSize + 
##     DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - Wheelbase           1    0.000 34.000
## - Man.trans.avail     1    0.000 34.000
## - Rear.seat.room      1    0.000 34.000
## <none>                     0.000 36.000
## + Horsepower          1    0.000 38.000
## + MPG.city            1    0.000 38.000
## + AirBags             2    0.000 40.000
## + Type                4    0.000 44.000
## + Cylinders           4    0.000 44.000
## - Turn.circle         1   12.012 46.012
## - Max.Price           1   14.248 48.248
## - Luggage.room        1   14.310 48.310
## - RPM                 1   18.532 52.532
## - DriveTrain          2   21.664 53.664
## - Passengers          1   20.892 54.892
## - Weight              1   22.410 56.410
## - Width               1   26.542 60.542
## - Length              1   29.571 63.571
## - Min.Price           1   30.102 64.102
## - EngineSize          1   32.639 66.639
## - MPG.highway         1   34.916 68.916
## - Fuel.tank.capacity  1   43.167 77.167
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=34
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width + 
##     Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity + 
##     Passengers + Man.trans.avail + EngineSize + DriveTrain + 
##     RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## - Man.trans.avail     1     0.00  32.00
## <none>                      0.00  34.00
## + Wheelbase           1     0.00  36.00
## + AirBags             2     0.00  38.00
## + Type                4     0.00  42.00
## + Cylinders           4     0.00  42.00
## - Turn.circle         1    12.14  44.14
## - Max.Price           1    14.26  46.26
## - Luggage.room        1    16.30  48.30
## - Rear.seat.room      1    17.11  49.11
## - RPM                 1    19.66  51.66
## - Passengers          1    20.97  52.97
## - Weight              1    22.42  54.42
## - DriveTrain          2    24.55  54.55
## - Width               1    29.44  61.44
## - Min.Price           1    30.52  62.52
## - EngineSize          1    34.61  66.61
## - Length              1    35.19  67.19
## - MPG.highway         1    35.40  67.40
## - Fuel.tank.capacity  1    43.23  75.23
## + Horsepower          1   576.70 612.70
## + MPG.city            1   720.87 756.87
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=32
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width + 
##     Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity + 
##     Passengers + EngineSize + DriveTrain + RPM
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                      Df Deviance    AIC
## <none>                     0.000 32.000
## + Man.trans.avail     1    0.000 34.000
## + MPG.city            1    0.000 34.000
## + Wheelbase           1    0.000 34.000
## + Horsepower          1    0.000 34.000
## + AirBags             2    0.000 36.000
## + Type                4    0.000 40.000
## + Cylinders           4    0.000 40.000
## - Turn.circle         1   13.111 43.111
## - Max.Price           1   15.657 45.657
## - RPM                 1   22.785 52.785
## - Rear.seat.room      1   23.245 53.245
## - DriveTrain          2   25.495 53.495
## - Passengers          1   24.097 54.097
## - Luggage.room        1   25.286 55.286
## - Width               1   30.549 60.549
## - Min.Price           1   30.572 60.572
## - Weight              1   31.250 61.250
## - Length              1   37.809 67.809
## - MPG.highway         1   38.322 68.322
## - EngineSize          1   40.832 70.832
## - Fuel.tank.capacity  1   45.981 75.981
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Origin ~ Min.Price + Max.Price + MPG.city + Type + MPG.highway + 
##     AirBags + Length + Wheelbase + Width + Weight + Turn.circle + 
##     Luggage.room + Rear.seat.room + Fuel.tank.capacity + Passengers + 
##     Man.trans.avail + Horsepower + Cylinders + EngineSize + DriveTrain + 
##     RPM
## 
## Final Model:
## Origin ~ Min.Price + Max.Price + MPG.highway + Length + Width + 
##     Weight + Turn.circle + Luggage.room + Rear.seat.room + Fuel.tank.capacity + 
##     Passengers + EngineSize + DriveTrain + RPM
## 
## 
##                Step Df     Deviance Resid. Df   Resid. Dev AIC
## 1                                          52 5.165456e-08  60
## 2       - Cylinders  4 1.557887e-09        56 5.009667e-08  52
## 3            - Type  4 2.439268e-08        60 7.448935e-08  44
## 4         - AirBags  2 1.474398e-07        62 2.219291e-07  40
## 5        - MPG.city  1 1.712590e-10        63 2.221004e-07  38
## 6      - Horsepower  1 2.358373e-08        64 2.456841e-07  36
## 7       - Wheelbase  1 1.398773e-08        65 2.596719e-07  34
## 8 - Man.trans.avail  1 9.799847e-07        66 1.239657e-06  32

Variable selection is an important step in modeling building. There are many procedures of variable selection in machine learning and here we just are just getting a start. Stepwise procedure to select variables often does not give good results. Neither are stepwise procedures well justified by theory. I just want you to understand the process of variable selection, and there are many other options, which we wil learn later.

Now, I will select out some variables one at a time. This part is some what of an art. This is by no means optimal, but it sort of shows you how we select variables.

glmres<-glm(Origin ~ Min.Price + Length + Width + Weight +  Fuel.tank.capacity + EngineSize + DriveTrain , 
            data=Cars93, family=binomial(link=logit));
summary(glmres);
## 
## Call:
## glm(formula = Origin ~ Min.Price + Length + Width + Weight + 
##     Fuel.tank.capacity + EngineSize + DriveTrain, family = binomial(link = logit), 
##     data = Cars93)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.21471  -0.32840  -0.00019   0.50030   2.31225  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        54.025850  18.264187   2.958  0.00310 **
## Min.Price           0.414948   0.139692   2.970  0.00297 **
## Length             -0.128320   0.055196  -2.325  0.02008 * 
## Width              -0.819210   0.338757  -2.418  0.01559 * 
## Weight              0.006425   0.002850   2.254  0.02419 * 
## Fuel.tank.capacity  1.005976   0.438570   2.294  0.02180 * 
## EngineSize         -7.954927   2.534468  -3.139  0.00170 **
## DriveTrainFront     3.188878   1.443631   2.209  0.02718 * 
## DriveTrainRear      2.743798   2.518951   1.089  0.27604   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.829  on 92  degrees of freedom
## Residual deviance:  54.842  on 84  degrees of freedom
## AIC: 72.842
## 
## Number of Fisher Scoring iterations: 8
# interpreting coefficient of Length
exp(-0.128320)
## [1] 0.8795719
# interpreting cofficient of Width
exp(-0.819210)
## [1] 0.4407797
# now trying fuel tank
exp(1.005976)
## [1] 2.734575

Because both Length and Width are < 1, larger length and width means it’s more likely to be a US car.

Since fuel tank > 1, bigger fuel tank means it’s more likely to be non-US car

So if the expondent < 1, it is more likely to be the category which is alphabetically later (US) If the exponsent > 1, it is morelikely to be the category which is first alphabetically (non-US)

US cars are larger in Length and Width, and Engine Size, and more likely 4WD. US cars also have smaller fuel tank and lower min price.

View(Cars93)

Convert the categorical variables as factors.

as.factor(Cars93$origin)
## factor(0)
## Levels:
View(Cars93)
# looking at factors in categorical variables
levels(Cars93$DriveTrain)
## [1] "4WD"   "Front" "Rear"
levels(Cars93$AirBags)
## [1] "Driver & Passenger" "Driver only"        "None"

Run vif to determine if there are problems with multicollinearity.

vif(glmres)
##                         GVIF Df GVIF^(1/(2*Df))
## Min.Price           5.598913  1        2.366202
## Length              3.510262  1        1.873569
## Width               7.141993  1        2.672451
## Weight             19.650156  1        4.432850
## Fuel.tank.capacity 12.070956  1        3.474328
## EngineSize         20.175956  1        4.491765
## DriveTrain          2.764046  2        1.289396

We should remove any variables with VIF greater then 10. So we should remove Weight, Fuel tank capacity, and Engine Size.

Often to assess model accuracy, ROC AUC is used. The higher the ROC AUC, the better. ROC AUC at 0.5 when model has no precitive power, while at ROC AUC near 1 the model is very accurate. Typical good models are in the range of 0.8 to 0.9.

library(pROC);
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
rocobj<-roc(Cars93$Origin, glmres$fitted.values);
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases
auc(rocobj);
## Area under the curve: 0.9458
coords(rocobj, "best", "threshold")$threshold
## [1] 0.5815575

Ideally, we need to segment the data and use train, validation and test data sets and ROC AUC should be assessed on test data set. Here, I just show how ROC AUC is done. When we are more familiar with data subsetting, we will come back on this again.

plot.roc(Cars93$Origin, glmres$fitted.values,
main="Confidence interval of a threshold", percent=TRUE,
ci=TRUE, of="thresholds", # compute AUC (of threshold)
thresholds="best", # select the (best) threshold
print.thres="best") # also highlight this threshold on the plot
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases

what am i looking at on this graph? the specificity and the sensitivity are both high

but really just look at the auc value if it is 1 it’s perfect if it’s 0.5 you still need to work on the model

rocobj<-roc(Cars93$Origin, glmres$fitted.values);
## Setting levels: control = USA, case = non-USA
## Setting direction: controls < cases
rocobj
## 
## Call:
## roc.default(response = Cars93$Origin, predictor = glmres$fitted.values)
## 
## Data: glmres$fitted.values in 48 controls (Cars93$Origin USA) < 45 cases (Cars93$Origin non-USA).
## Area under the curve: 0.9458

ANOVA with 2 Groups

We are comparing means between two groups: we do this 3 ways, the horrible manual way, the anova way, the Tukey’s way.

Analyzing a small table on two groups.

First read the data. (For convenience, I copied all data used in this homework assignment into an Excel file.)

Comparing means between two groups using the horrible manual way

Avoid this if possible– see next two problems for examples of getting the same answer using aov and Tukey’s

Calculate the group means and global mean, calculate MSA and MSW, and F. Use qf(0.95, df1=?, df2=?) to get the critical value for F. Is the difference significant?

Calculating df1 and df2: df1: # of citys/ groups - 1 df2: # of rows - # of citys/ groups

Question 2. Analyzing a small table on two groups.

First read the data. (For convenience, I copied all data used in this homework assignment into an Excel file.)
* Note: you only need the Excel file for the horrible manual way, aov and Tukey’s will work without it * If you want the horrible manual way to work, put the Excel file HW1_Data.xlsx in the same folder as this R markdown file.

Calculate the group means

library(readxl)
df <- read_excel("HW1_Data.xlsx")
df
## # A tibble: 6 × 2
##   Group Value
##   <dbl> <dbl>
## 1     1   1.3
## 2     1   2.2
## 3     2   3.7
## 4     2   2.8
## 5     2   3.2
## 6     2   2.7
aggregate(df$Value, list(df$Group), FUN=mean)
##   Group.1    x
## 1       1 1.75
## 2       2 3.10

add to our dataframe

df$GroupMean = c(1.75, 1.75, 3.10, 3.10, 3.10, 3.10)
df
## # A tibble: 6 × 3
##   Group Value GroupMean
##   <dbl> <dbl>     <dbl>
## 1     1   1.3      1.75
## 2     1   2.2      1.75
## 3     2   3.7      3.1 
## 4     2   2.8      3.1 
## 5     2   3.2      3.1 
## 6     2   2.7      3.1

Now calculate global mean

global_mean = mean(df$Value)
global_mean
## [1] 2.65

Now start calculating SSW

# first the diff between each value and its group mean
df$diff = df$Value - df$GroupMean
df
## # A tibble: 6 × 4
##   Group Value GroupMean   diff
##   <dbl> <dbl>     <dbl>  <dbl>
## 1     1   1.3      1.75 -0.45 
## 2     1   2.2      1.75  0.45 
## 3     2   3.7      3.1   0.6  
## 4     2   2.8      3.1  -0.300
## 5     2   3.2      3.1   0.100
## 6     2   2.7      3.1  -0.4
# now square each diff
df$diffsquared = df$diff^2
df
## # A tibble: 6 × 5
##   Group Value GroupMean   diff diffsquared
##   <dbl> <dbl>     <dbl>  <dbl>       <dbl>
## 1     1   1.3      1.75 -0.45       0.202 
## 2     1   2.2      1.75  0.45       0.203 
## 3     2   3.7      3.1   0.6        0.36  
## 4     2   2.8      3.1  -0.300      0.0900
## 5     2   3.2      3.1   0.100      0.0100
## 6     2   2.7      3.1  -0.4        0.16

finally calculate SSW

SSW = sum(df$diffsquared)
SSW
## [1] 1.025

now start calculating SSA

# grab the values we need into a new df and add the global mean
for_ssa = df[ , c('Group', 'Value')]
for_ssa$Mean = global_mean
for_ssa
## # A tibble: 6 × 3
##   Group Value  Mean
##   <dbl> <dbl> <dbl>
## 1     1   1.3  2.65
## 2     1   2.2  2.65
## 3     2   3.7  2.65
## 4     2   2.8  2.65
## 5     2   3.2  2.65
## 6     2   2.7  2.65
# calculate each diff and square it
for_ssa$diff_squared = (for_ssa$Value - for_ssa$Mean)^2
for_ssa
## # A tibble: 6 × 4
##   Group Value  Mean diff_squared
##   <dbl> <dbl> <dbl>        <dbl>
## 1     1   1.3  2.65      1.82   
## 2     1   2.2  2.65      0.202  
## 3     2   3.7  2.65      1.10   
## 4     2   2.8  2.65      0.0225 
## 5     2   3.2  2.65      0.303  
## 6     2   2.7  2.65      0.00250
# now calculate SSA
SSA = sum(for_ssa$diff_squared)
SSA
## [1] 3.455

Now calculate MSA and MSW

# MSA
n = 6
group1_n = 2

MSA = SSA/(n - group1_n)
MSA
## [1] 0.86375
# MSW

MSW = SSW/(2-1)
MSW
## [1] 1.025

Now calculate F

F = MSA/MSW
F
## [1] 0.8426829

Use qf(0.95, df1=?, df2=?) to get the critical value for F. Is the difference significant?

df1=2-1; #Degree of freedom for Mean Square Among =  of groups - 1
df2=6-2; #Degree of freedom for Mean Square Within =  of rows - # of groups
qf(0.95, df1, df2)
## [1] 7.708647

So our f value is less then the critical value of 7.7, which means this is signficant.

The F test is significant!

Now that we’ve done the anova manually, here’s how you do it automatically with AOV. Do this instead unless she forces you to do the horrible manual way!!!

dataq2 = as.data.frame(list(c(1.3,2.2,3.7,2.8,3.2,2.7), c("Group 1","Group 1","Group 2","Group 2","Group 2","Group 2")));
colnames(dataq2)=c("Value", "Group");

dataq2
##   Value   Group
## 1   1.3 Group 1
## 2   2.2 Group 1
## 3   3.7 Group 2
## 4   2.8 Group 2
## 5   3.2 Group 2
## 6   2.7 Group 2
print("column names are: ")
## [1] "column names are: "
colnames(dataq2);
## [1] "Value" "Group"
print("column 1 is")
## [1] "column 1 is"
dataq2[,1];
## [1] 1.3 2.2 3.7 2.8 3.2 2.7
print("column 2 is")
## [1] "column 2 is"
dataq2[,2];
## [1] "Group 1" "Group 1" "Group 2" "Group 2" "Group 2" "Group 2"
model<-aov(Value~Group, dataq2);
summary(model);
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Group        1  2.430  2.4300   9.483  0.037 *
## Residuals    4  1.025  0.2562                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is less then 5%, this is significant! Same answer as above but with much less work!

Doing Tukey’s on the same data set:

Again,avoid the horrible manual way if at all possible!!

TukeyHSD(model, conf.level = .95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Value ~ Group, data = dataq2)
## 
## $Group
##                 diff       lwr      upr     p adj
## Group 2-Group 1 1.35 0.1328235 2.567176 0.0369525

Interpretation: Because the pvalue is less the 5%, this is significant! We have statistically significant data to say that the groups are different.

Problem: chi square with 3 groups

datamatrix = matrix(c(470,3500,430,3300,150,1000),nrow=3, ncol=2, byrow=TRUE, dimnames=list(c('email v1', 'email v2', 'control'), c("clicked", "not clicked")))

datamatrix
##          clicked not clicked
## email v1     470        3500
## email v2     430        3300
## control      150        1000
chisq.test(datamatrix)
## 
##  Pearson's Chi-squared test
## 
## data:  datamatrix
## X-squared = 1.9347, df = 2, p-value = 0.3801
# computing the new p-value with the bonferoni adjustment
# we need the bonferoni test because we have more then 2 categories
bon = choose(3,2) # 3 because we have 3 rows, 2 because it's binary
bon
## [1] 3
new_significance = 0.05/bon
new_significance
## [1] 0.01666667

Since the p-value of 0.3801 is higer then our new significance of 0.016666 this is not significant We cannot prove that there is a statistically significant differcen between the categories.

Part 2: warning, the questions below are less well edited and more confusing, you’re probably better off with the questions in part 1

3. ANOVA The following code is Poison data set we saw in class:

library(dplyr); 
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
PATH <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv" 
df <- read.csv(PATH) %>% dplyr::select(-X) %>% mutate(poison = factor(poison, ordered = TRUE));
glimpse(df);
## Rows: 48
## Columns: 3
## $ time   <dbl> 0.31, 0.45, 0.46, 0.43, 0.36, 0.29, 0.40, 0.23, 0.22, 0.21, 0.1…
## $ poison <ord> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, …
## $ treat  <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B"…
library(car); # Levene's test with one independent variable 
leveneTest(time ~ poison, data = df);  # p-value > 0.05 means variances are not significantly different
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  2  4.1964 0.02133 *
##       45                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(df[which(df$poison==1),]$time); 
## 
##  Shapiro-Wilk normality test
## 
## data:  df[which(df$poison == 1), ]$time
## W = 0.93235, p-value = 0.2656
shapiro.test(df[which(df$poison==2),]$time); 
## 
##  Shapiro-Wilk normality test
## 
## data:  df[which(df$poison == 2), ]$time
## W = 0.85254, p-value = 0.01485
shapiro.test(df[which(df$poison==3),]$time); 
## 
##  Shapiro-Wilk normality test
## 
## data:  df[which(df$poison == 3), ]$time
## W = 0.93596, p-value = 0.3024
# p-value > 0.05 means distribution is not significantly different from a normal distribution

Conclusion: for this sample, homogeneity of variance is not satified and not all distributions are normal.

Use the R code below to generate a matrix.

mydata<-matrix(c(50,950,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c(“treat”,“control”),c(“clicked”,“nonlicked”)))

Do a Chi Squared Test (with and without Yate’s Continuity Correction) and Fisher exact test on it. In addition, use simulate.p.value option with these tests. List out the results.

Is it significant?

If it is not significant, pick one test that you think is appropriate. Use that test to determine how much better the Treatment cell has to be (more clicks and fewer non clicks) without increasing the overall sample size, to get a significant test result? Write program in R to show the p-value as the Treatment sample improves. Make a plot of p-values vs sample clicks.

If the click through rates for the Treatment and Control cells maintain the same, what is the minimum sample size in order to get a significant test? Write program in R to show the p-value as the sample size changes. Make a plot of p-values vs sample size.

mydata<-matrix(c(50,950,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonlicked")));

chisq.test(mydata);
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata
## X-squared = 0.94241, df = 1, p-value = 0.3317
mydata
##         clicked nonlicked
## treat        50       950
## control      40       960

Conclusion: Because the p-value is over 5%, the difference between groups is not significant.

The chi square test is not significant.

How high the treatment cell conversion rate has to be in order to get a significant test using the same sample size?

addclick<-1:100; xvalue<-seq(1:100); yvalue<-rep(1,100); #We have to initiate the variables to hold the x-values and y-values; 
print("This is for test a higer conversion rate:");
## [1] "This is for test a higer conversion rate:"
istop=1;
for (i in addclick) {
          mydata<-matrix(c(50+i,950-i,40,960),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonclicked"))); 
          cst<-chisq.test(mydata);         
          print(mydata); print(cst$p.value);  
          istop=i;
          xvalue[i]<-50+i;         
          yvalue[i]<-cst$p.value; 
          if (cst$p.value < 0.05) {break;}
}
##         clicked nonclicked
## treat        51        949
## control      40        960
## [1] 0.2832804
##         clicked nonclicked
## treat        52        948
## control      40        960
## [1] 0.2403339
##         clicked nonclicked
## treat        53        947
## control      40        960
## [1] 0.2025491
##         clicked nonclicked
## treat        54        946
## control      40        960
## [1] 0.1695927
##         clicked nonclicked
## treat        55        945
## control      40        960
## [1] 0.1410887
##         clicked nonclicked
## treat        56        944
## control      40        960
## [1] 0.116636
##         clicked nonclicked
## treat        57        943
## control      40        960
## [1] 0.09582445
##         clicked nonclicked
## treat        58        942
## control      40        960
## [1] 0.07824783
##         clicked nonclicked
## treat        59        941
## control      40        960
## [1] 0.06351396
##         clicked nonclicked
## treat        60        940
## control      40        960
## [1] 0.05125258
##         clicked nonclicked
## treat        61        939
## control      40        960
## [1] 0.04112072
plot(xvalue[1:istop], yvalue[1:istop], xlab = "Number of Clicks Needed");

If the treatment conversion rates stay the same, how larger the samples have to be to make a significant test.

print("This is for test sample size:");
## [1] "This is for test sample size:"
addclick<-1:100; xvalue<-seq(1:100); yvalue<-rep(1,100);
istop=1;
for (i in addclick) {
          mydata<-matrix(c(round(50*(1+i/10.0)),round(950*(1+i/10.0)),round(40*(1+i/10.0)),                    round(960*(1+i/10.0))),nrow=2,ncol=2,byrow=TRUE, dimnames = list(c("treat","control"),c("clicked","nonclicked")));
          cst<-chisq.test(mydata);
          print(mydata); print(cst$p.value); 
          istop=i;
          xvalue[i]<-round(1000*(1+i/10.0));
          yvalue[i]<-cst$p.value
          if (cst$p.value < 0.05) {break;}
};
##         clicked nonclicked
## treat        55       1045
## control      44       1056
## [1] 0.3037409
##         clicked nonclicked
## treat        60       1140
## control      48       1152
## [1] 0.2787523
##         clicked nonclicked
## treat        65       1235
## control      52       1248
## [1] 0.2562759
##         clicked nonclicked
## treat        70       1330
## control      56       1344
## [1] 0.2359764
##         clicked nonclicked
## treat        75       1425
## control      60       1440
## [1] 0.2175796
##         clicked nonclicked
## treat        80       1520
## control      64       1536
## [1] 0.2008579
##         clicked nonclicked
## treat        85       1615
## control      68       1632
## [1] 0.1856199
##         clicked nonclicked
## treat        90       1710
## control      72       1728
## [1] 0.171703
##         clicked nonclicked
## treat        95       1805
## control      76       1824
## [1] 0.1589678
##         clicked nonclicked
## treat       100       1900
## control      80       1920
## [1] 0.1472935
##         clicked nonclicked
## treat       105       1995
## control      84       2016
## [1] 0.1365752
##         clicked nonclicked
## treat       110       2090
## control      88       2112
## [1] 0.1267208
##         clicked nonclicked
## treat       115       2185
## control      92       2208
## [1] 0.1176493
##         clicked nonclicked
## treat       120       2280
## control      96       2304
## [1] 0.1092888
##         clicked nonclicked
## treat       125       2375
## control     100       2400
## [1] 0.1015756
##         clicked nonclicked
## treat       130       2470
## control     104       2496
## [1] 0.09445288
##         clicked nonclicked
## treat       135       2565
## control     108       2592
## [1] 0.08786954
##         clicked nonclicked
## treat       140       2660
## control     112       2688
## [1] 0.08177985
##         clicked nonclicked
## treat       145       2755
## control     116       2784
## [1] 0.07614258
##         clicked nonclicked
## treat       150       2850
## control     120       2880
## [1] 0.07092049
##         clicked nonclicked
## treat       155       2945
## control     124       2976
## [1] 0.06607989
##         clicked nonclicked
## treat       160       3040
## control     128       3072
## [1] 0.06159021
##         clicked nonclicked
## treat       165       3135
## control     132       3168
## [1] 0.05742367
##         clicked nonclicked
## treat       170       3230
## control     136       3264
## [1] 0.05355497
##         clicked nonclicked
## treat       175       3325
## control     140       3360
## [1] 0.04996104
plot(xvalue[1:istop], yvalue[1:istop],xlab = "Sample Size Needed");

Conclusion: we need treatment cell to have 61 clicks out of 1000 to get a singificant test; if we don’t change the conversion rates, we need a sample of 3500 out of which 175 will be clicks and 3325 non clicks to get a significant test.