Recipe 7: Analysis of Variance with Resampling on USDA National Nutrient Database using Monte Carlo Simulation

Design of Experiments

ISYE 6020

Trevor Manzanares

Rensselaer Polytechnic Institute

11/6/14

Dataset under analysis includes data on Release 27 of the USDA National Nutrient Database for Standard Reference. There are 8,618 different foods included, but for the Analysis of Variance, this list will be truncated to 25 levels based on the first letter of the food name. Note, there was no food name starting with the letter “X”.

remove(list=ls())
ag = read.csv("C:/Users/Trevor/Documents/agriculture.csv", header=TRUE) 

#view first few lines
head(ag)

##   NDB_No                        Shrt_Desc Short_Description Water_.g.
## 1  15155        ABALONE,MIXED SPECIES,RAW                 A     74.56
## 2  15156         ABALONE,MXD SP,CKD,FRIED                 A     60.10
## 3   9427                      ABIYUCH,RAW                 A     79.90
## 4   9002                ACEROLA JUICE,RAW                 A     94.30
## 5   9001 ACEROLA,(WEST INDIAN CHERRY),RAW                 A     91.41
## 6  12060             ACORN FLOUR,FULL FAT                 A      6.00
##   Energ_Kcal Protein Lipid_Tot_.g. Ash_.g. Carbohydrt_.g. Fiber_TD_.g.
## 1        105   17.10          0.76    1.57           6.01          0.0
## 2        189   19.63          6.78    1.77          11.05          0.0
## 3         69    1.50          0.10    0.90          17.60          5.3
## 4         23    0.40          0.30    0.20           4.80          0.3
## 5         32    0.40          0.30    0.20           7.69          1.1
## 6        501    7.49         30.17    1.69          54.65           NA
##   Sugar_Tot_.g. Calcium_.mg. Iron_.mg. Magnesium_.mg. Phosphorus_.mg.
## 1          0.00           31      3.19             48             190
## 2            NA           37      3.80             56             217
## 3          8.55            8      1.61             24              47
## 4          4.50           10      0.50             12               9
## 5            NA           12      0.20             18              11
## 6            NA           43      1.21            110             103
##   Potassium_.mg. Sodium_.mg. Zinc_.mg. Copper_mg. Manganese_.mg.
## 1            250         301      0.82      0.196          0.040
## 2            284         591      0.95      0.228          0.070
## 3            304          20      0.31      0.057          0.182
## 4             97           3      0.10      0.086             NA
## 5            146           7      0.10      0.086             NA
## 6            712           0      0.64      0.611          1.743
##   Selenium_.µg. Vit_C_.mg. Thiamin_.mg. Riboflavin_.mg. Niacin_.mg.
## 1          44.8        2.0        0.190           0.100       1.500
## 2          51.8        1.8        0.220           0.130       1.900
## 3            NA       54.1           NA              NA          NA
## 4           0.1     1600.0        0.020           0.060       0.400
## 5           0.6     1677.6        0.020           0.060       0.400
## 6            NA        0.0        0.146           0.154       2.382
##   Panto_Acid_mg. Vit_B6_.mg. Folate_Tot_.µg. Folic_Acid_.µg.
## 1          3.000       0.150               5               0
## 2          2.870       0.150              14               9
## 3             NA          NA              NA              NA
## 4          0.205       0.004              14               0
## 5          0.309       0.009              14               0
## 6          0.931       0.688             114               0
##   Food_Folate_.µg. Folate_DFE_.µg. Choline_Tot_..mg. Vit_B12_.µg. Vit_A_IU
## 1                5               5                65         0.73        7
## 2                5              20                NA         0.69        5
## 3               NA              NA                NA           NA      100
## 4               14              14                NA         0.00      509
## 5               14              14                NA         0.00      767
## 6              114             114                NA         0.00       51
##   Vit_A_RAE Retinol_.µg. Alpha_Carot_.µg. Beta_Carot_.µg. Beta_Crypt_.µg.
## 1         2            2                0               0               0
## 2         2            2               NA              NA              NA
## 3         5            0                0              60              NA
## 4        25            0                0             305               0
## 5        38            0               NA              NA              NA
## 6         3            0               NA              NA              NA
##   Lycopene_.µg. Lut.Zea_..µg. Vit_E_.mg. Vit_D_µg Vit_D_IU Vit_K_.µg.
## 1             0             0       4.00        0        0       23.0
## 2            NA            NA         NA       NA       NA         NA
## 3            NA            NA         NA       NA       NA         NA
## 4             0            17       0.18       NA       NA        1.4
## 5            NA            NA         NA       NA       NA         NA
## 6            NA            NA         NA       NA       NA         NA
##   FA_Sat_.g. FA_Mono_.g. FA_Poly_.g. Cholestrl_.mg. GmWt_1 GmWt_Desc1
## 1      0.149       0.107       0.104             85     85       3 oz
## 2      1.646       2.741       1.676             94     85       3 oz
## 3      0.014          NA          NA             NA    114     .5 cup
## 4      0.068       0.082       0.090              0    242      1 cup
## 5      0.068       0.082       0.090              0     98      1 cup
## 6      3.923      19.110       5.813              0     28       1 oz
##   GmWt_2              GmWt_Desc2 Refuse_Pct
## 1     NA                                  0
## 2     NA                                  0
## 3     NA                                 NA
## 4   30.2                 1 fl oz          0
## 5    4.8 1 fruit, without refuse         20
## 6     NA                                  0

#observe the structure of the data
str(ag)

## 'data.frame':    8600 obs. of  54 variables:
##  $ NDB_No           : int  15155 15156 9427 9002 9001 12060 35182 12059 12058 35193 ...
##  $ Shrt_Desc        : Factor w/ 8595 levels "ABALONE,MIXED SPECIES,RAW",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Short_Description: Factor w/ 25 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Water_.g.        : num  74.6 60.1 79.9 94.3 91.4 ...
##  $ Energ_Kcal       : int  105 189 69 23 32 501 95 509 387 135 ...
##  $ Protein          : num  17.1 19.6 1.5 0.4 0.4 ...
##  $ Lipid_Tot_.g.    : num  0.76 6.78 0.1 0.3 0.3 ...
##  $ Ash_.g.          : num  1.57 1.77 0.9 0.2 0.2 1.69 0.72 1.78 1.35 1.32 ...
##  $ Carbohydrt_.g.   : num  6.01 11.05 17.6 4.8 7.69 ...
##  $ Fiber_TD_.g.     : num  0 0 5.3 0.3 1.1 NA 0.7 NA NA 10.6 ...
##  $ Sugar_Tot_.g.    : num  0 NA 8.55 4.5 NA ...
##  $ Calcium_.mg.     : int  31 37 8 10 12 43 14 54 41 460 ...
##  $ Iron_.mg.        : num  3.19 3.8 1.61 0.5 0.2 1.21 1 1.04 0.79 3.55 ...
##  $ Magnesium_.mg.   : int  48 56 24 12 18 110 12 82 62 39 ...
##  $ Phosphorus_.mg.  : int  190 217 47 9 11 103 62 103 79 9 ...
##  $ Potassium_.mg.   : int  250 284 304 97 146 712 110 709 539 59 ...
##  $ Sodium_.mg.      : int  301 591 20 3 7 0 130 0 0 13 ...
##  $ Zinc_.mg.        : num  0.82 0.95 0.31 0.1 0.1 0.64 1.6 0.67 0.51 0.25 ...
##  $ Copper_mg.       : num  0.196 0.228 0.057 0.086 0.086 0.611 0.03 0.818 0.621 0.112 ...
##  $ Manganese_.mg.   : num  0.04 0.07 0.182 NA NA ...
##  $ Selenium_.µg.    : num  44.8 51.8 NA 0.1 0.6 NA 8.3 NA NA 0.2 ...
##  $ Vit_C_.mg.       : num  2 1.8 54.1 1600 1677.6 ...
##  $ Thiamin_.mg.     : num  0.19 0.22 NA 0.02 0.02 0.146 0.175 0.149 0.112 0.012 ...
##  $ Riboflavin_.mg.  : num  0.1 0.13 NA 0.06 0.06 0.154 0.125 0.154 0.118 0.099 ...
##  $ Niacin_.mg.      : num  1.5 1.9 NA 0.4 0.4 ...
##  $ Panto_Acid_mg.   : num  3 2.87 NA 0.205 0.309 0.931 0.212 0.94 0.715 0.041 ...
##  $ Vit_B6_.mg.      : num  0.15 0.15 NA 0.004 0.009 0.688 0.055 0.695 0.528 0.087 ...
##  $ Folate_Tot_.µg.  : int  5 14 NA 14 14 114 33 115 87 3 ...
##  $ Folic_Acid_.µg.  : int  0 9 NA 0 0 0 15 0 0 0 ...
##  $ Food_Folate_.µg. : int  5 5 NA 14 14 114 18 115 87 3 ...
##  $ Folate_DFE_.µg.  : int  5 20 NA 14 14 114 44 115 87 3 ...
##  $ Choline_Tot_..mg.: num  65 NA NA NA NA NA NA NA NA 8.8 ...
##  $ Vit_B12_.µg.     : num  0.73 0.69 NA 0 0 0 0.68 0 0 0 ...
##  $ Vit_A_IU         : int  7 5 100 509 767 51 0 0 39 113 ...
##  $ Vit_A_RAE        : int  2 2 5 25 38 3 0 0 2 6 ...
##  $ Retinol_.µg.     : int  2 2 0 0 0 0 0 0 0 0 ...
##  $ Alpha_Carot_.µg. : int  0 NA 0 0 NA NA NA NA NA 0 ...
##  $ Beta_Carot_.µg.  : int  0 NA 60 305 NA NA NA NA NA 68 ...
##  $ Beta_Crypt_.µg.  : int  0 NA NA 0 NA NA NA NA NA 0 ...
##  $ Lycopene_.µg.    : int  0 NA NA 0 NA NA NA NA NA 0 ...
##  $ Lut.Zea_..µg.    : int  0 NA NA 17 NA NA NA NA NA NA ...
##  $ Vit_E_.mg.       : num  4 NA NA 0.18 NA NA 0.3 NA NA 0.36 ...
##  $ Vit_D_µg         : num  0 NA NA NA NA NA NA NA NA 0 ...
##  $ Vit_D_IU         : int  0 NA NA NA NA NA NA NA NA 0 ...
##  $ Vit_K_.µg.       : num  23 NA NA 1.4 NA NA 0 NA NA 4.9 ...
##  $ FA_Sat_.g.       : num  0.149 1.646 0.014 0.068 0.068 ...
##  $ FA_Mono_.g.      : num  0.107 2.741 NA 0.082 0.082 ...
##  $ FA_Poly_.g.      : num  0.104 1.676 NA 0.09 0.09 ...
##  $ Cholestrl_.mg.   : int  85 94 NA 0 0 0 20 0 0 0 ...
##  $ GmWt_1           : int  85 85 114 242 98 28 NA 28 28 NA ...
##  $ GmWt_Desc1       : Factor w/ 902 levels "",".083 package",..: 825 825 35 158 158 370 1 370 370 1 ...
##  $ GmWt_2           : num  NA NA NA 30.2 4.8 NA NA NA NA NA ...
##  $ GmWt_Desc2       : Factor w/ 911 levels "",".083 package",..: 1 1 1 314 331 1 1 1 1 1 ...
##  $ Refuse_Pct       : int  0 0 NA 0 20 0 0 38 38 0 ...

#8618 observations of 54 variables
attach(ag)

The data are tabluated into 54 columns, with detailed information on food type, related nutional information for each type, and common household measures. This project will test the effects of the factor “Short_Description” (25 levels) on the response, “Protein”, measured in grams for each serving size. Summary statistics are as follows:

summary(ag)

##      NDB_No     
##  Min.   : 1001  
##  1st Qu.: 8692  
##  Median :14304  
##  Mean   :15473  
##  3rd Qu.:20056  
##  Max.   :93600  
##                 
##                                                         Shrt_Desc   
##  BABYFOOD,MEAT,BF,STR                                        :   2  
##  BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA:   2  
##  OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT    :   2  
##  POPCORN,OIL-POPPED,LOFAT                                    :   2  
##  ZWIEBACK                                                    :   2  
##  ABALONE,MIXED SPECIES,RAW                                   :   1  
##  (Other)                                                     :8589  
##  Short_Description   Water_.g.       Energ_Kcal     Protein     
##  B      :1649      Min.   :  0.0   Min.   :  0   Min.   : 0.00  
##  C      :1601      1st Qu.: 29.9   1st Qu.: 93   1st Qu.: 2.49  
##  P      : 969      Median : 63.2   Median :191   Median : 8.30  
##  S      : 867      Mean   : 54.1   Mean   :227   Mean   :11.54  
##  L      : 408      3rd Qu.: 77.5   3rd Qu.:336   3rd Qu.:19.99  
##  M      : 407      Max.   :100.0   Max.   :902   Max.   :88.32  
##  (Other):2699      NA's   :6                                    
##  Lipid_Tot_.g.       Ash_.g.     Carbohydrt_.g.    Fiber_TD_.g. 
##  Min.   :  0.00   Min.   : 0.0   Min.   :  0.00   Min.   : 0.0  
##  1st Qu.:  1.00   1st Qu.: 0.9   1st Qu.:  0.04   1st Qu.: 0.0  
##  Median :  5.28   Median : 1.2   Median :  8.98   Median : 0.7  
##  Mean   : 10.66   Mean   : 1.8   Mean   : 21.85   Mean   : 2.2  
##  3rd Qu.: 13.94   3rd Qu.: 2.1   3rd Qu.: 33.11   3rd Qu.: 2.5  
##  Max.   :100.00   Max.   :99.8   Max.   :100.00   Max.   :79.0  
##                   NA's   :332                     NA's   :643   
##  Sugar_Tot_.g.   Calcium_.mg.    Iron_.mg.      Magnesium_.mg. 
##  Min.   : 0.0   Min.   :   0   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 0.0   1st Qu.:  10   1st Qu.:  0.57   1st Qu.: 13.0  
##  Median : 1.7   Median :  21   Median :  1.38   Median : 21.0  
##  Mean   : 8.5   Mean   :  76   Mean   :  2.74   Mean   : 35.5  
##  3rd Qu.: 9.0   3rd Qu.:  67   3rd Qu.:  2.60   3rd Qu.: 30.5  
##  Max.   :99.8   Max.   :7364   Max.   :123.60   Max.   :781.0  
##  NA's   :1925   NA's   :341    NA's   :134      NA's   :669    
##  Phosphorus_.mg. Potassium_.mg.   Sodium_.mg.      Zinc_.mg.   
##  Min.   :   0    Min.   :    0   Min.   :    0   Min.   : 0.0  
##  1st Qu.:  53    1st Qu.:  130   1st Qu.:   42   1st Qu.: 0.4  
##  Median : 149    Median :  232   Median :   90   Median : 1.0  
##  Mean   : 167    Mean   :  283   Mean   :  321   Mean   : 2.1  
##  3rd Qu.: 221    3rd Qu.:  337   3rd Qu.:  416   3rd Qu.: 3.0  
##  Max.   :9918    Max.   :16500   Max.   :38758   Max.   :91.0  
##  NA's   :559     NA's   :397     NA's   :70      NA's   :688   
##    Copper_mg.   Manganese_.mg.  Selenium_.µg.      Vit_C_.mg.    
##  Min.   : 0.0   Min.   :  0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 0.1   1st Qu.:  0.0   1st Qu.:   1.5   1st Qu.:   0.0  
##  Median : 0.1   Median :  0.1   Median :  10.5   Median :   0.0  
##  Mean   : 0.2   Mean   :  0.7   Mean   :  15.8   Mean   :   8.7  
##  3rd Qu.: 0.2   3rd Qu.:  0.3   3rd Qu.:  24.5   3rd Qu.:   3.5  
##  Max.   :15.1   Max.   :328.0   Max.   :1917.0   Max.   :2400.0  
##  NA's   :1242   NA's   :2128    NA's   :1738     NA's   :780     
##   Thiamin_.mg.  Riboflavin_.mg.  Niacin_.mg.    Panto_Acid_mg.
##  Min.   : 0.0   Min.   : 0.0    Min.   :  0.0   Min.   : 0.0  
##  1st Qu.: 0.0   1st Qu.: 0.1    1st Qu.:  0.6   1st Qu.: 0.2  
##  Median : 0.1   Median : 0.2    Median :  2.6   Median : 0.4  
##  Mean   : 0.2   Mean   : 0.3    Mean   :  3.7   Mean   : 0.7  
##  3rd Qu.: 0.2   3rd Qu.: 0.3    3rd Qu.:  5.3   3rd Qu.: 0.7  
##  Max.   :23.4   Max.   :17.5    Max.   :127.5   Max.   :34.5  
##  NA's   :667    NA's   :645     NA's   :669     NA's   :2145  
##   Vit_B6_.mg.   Folate_Tot_.µg. Folic_Acid_.µg.  Food_Folate_.µg.
##  Min.   : 0.0   Min.   :   0    Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 0.1   1st Qu.:   5    1st Qu.:   0.0   1st Qu.:   4.0  
##  Median : 0.2   Median :  12    Median :   0.0   Median :   9.0  
##  Mean   : 0.3   Mean   :  53    Mean   :  23.8   Mean   :  24.9  
##  3rd Qu.: 0.4   3rd Qu.:  46    3rd Qu.:   0.0   3rd Qu.:  21.0  
##  Max.   :12.0   Max.   :3786    Max.   :2993.0   Max.   :2340.0  
##  NA's   :929    NA's   :1233    NA's   :1974     NA's   :1743    
##  Folate_DFE_.µg. Choline_Tot_..mg.  Vit_B12_.µg.     Vit_A_IU     
##  Min.   :   0    Min.   :   0      Min.   : 0.0   Min.   :     0  
##  1st Qu.:   4    1st Qu.:  10      1st Qu.: 0.0   1st Qu.:     0  
##  Median :  11    Median :  27      Median : 0.3   Median :    33  
##  Mean   :  66    Mean   :  45      Mean   : 1.4   Mean   :   744  
##  3rd Qu.:  42    3rd Qu.:  71      3rd Qu.: 1.6   3rd Qu.:   259  
##  Max.   :5881    Max.   :2403      Max.   :98.9   Max.   :100000  
##  NA's   :1993    NA's   :4070      NA's   :1179   NA's   :673     
##    Vit_A_RAE      Retinol_.µg.   Alpha_Carot_.µg. Beta_Carot_.µg.
##  Min.   :    0   Min.   :    0   Min.   :    0    Min.   :    0  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:    0    1st Qu.:    0  
##  Median :    4   Median :    0   Median :    0    Median :    0  
##  Mean   :  114   Mean   :   92   Mean   :   33    Mean   :  224  
##  3rd Qu.:   34   3rd Qu.:   12   3rd Qu.:    0    3rd Qu.:   20  
##  Max.   :30000   Max.   :30000   Max.   :14251    Max.   :42891  
##  NA's   :1516    NA's   :1799    NA's   :3341     NA's   :3248   
##  Beta_Crypt_.µg. Lycopene_.µg.   Lut.Zea_..µg.     Vit_E_.mg.   
##  Min.   :   0    Min.   :    0   Min.   :    0   Min.   :  0.0  
##  1st Qu.:   0    1st Qu.:    0   1st Qu.:    0   1st Qu.:  0.1  
##  Median :   0    Median :    0   Median :    0   Median :  0.3  
##  Mean   :  11    Mean   :  165   Mean   :  193   Mean   :  1.3  
##  3rd Qu.:   0    3rd Qu.:    0   3rd Qu.:   28   3rd Qu.:  0.8  
##  Max.   :6252    Max.   :46260   Max.   :19697   Max.   :149.4  
##  NA's   :3352    NA's   :3376    NA's   :3398    NA's   :2991   
##     Vit_D_µg       Vit_D_IU       Vit_K_.µg.     FA_Sat_.g.  
##  Min.   :  0    Min.   :    0   Min.   :   0   Min.   : 0.0  
##  1st Qu.:  0    1st Qu.:    0   1st Qu.:   0   1st Qu.: 0.2  
##  Median :  0    Median :    0   Median :   2   Median : 1.6  
##  Mean   :  1    Mean   :   23   Mean   :  16   Mean   : 3.6  
##  3rd Qu.:  0    3rd Qu.:    7   3rd Qu.:   5   3rd Qu.: 4.4  
##  Max.   :250    Max.   :10000   Max.   :1714   Max.   :95.6  
##  NA's   :3286   NA's   :3285    NA's   :3635   NA's   :331   
##   FA_Mono_.g.    FA_Poly_.g.   Cholestrl_.mg.       GmWt_1      
##  Min.   : 0.0   Min.   : 0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 0.3   1st Qu.: 0.2   1st Qu.:   0.0   1st Qu.:  28.0  
##  Median : 2.0   Median : 0.7   Median :   5.0   Median :  85.0  
##  Mean   : 4.1   Mean   : 2.3   Mean   :  41.5   Mean   :  95.4  
##  3rd Qu.: 5.2   3rd Qu.: 2.1   3rd Qu.:  67.0   3rd Qu.: 130.0  
##  Max.   :83.7   Max.   :74.6   Max.   :3100.0   Max.   :1171.0  
##  NA's   :658    NA's   :651    NA's   :355      NA's   :246     
##      GmWt_Desc1       GmWt_2    
##  3 oz     :1377   Min.   :   0  
##  1 cup    :1160   1st Qu.:  38  
##  1 oz     : 903   Median : 163  
##  1 tbsp   : 342   Mean   : 248  
##  1 serving: 288   3rd Qu.: 301  
##  1 fl oz  : 246   Max.   :5717  
##  (Other)  :4284   NA's   :3817  
##                                                                     GmWt_Desc2  
##                                                                          :3817  
##  1 cup                                                                   : 293  
##  1 lb                                                                    : 266  
##  1 steak                                                                 : 263  
##  1 piece, cooked, excluding refuse (yield from 1 lb raw meat with refuse): 212  
##  1 oz                                                                    : 198  
##  (Other)                                                                 :3551  
##    Refuse_Pct   
##  Min.   : 0.00  
##  1st Qu.: 0.00  
##  Median : 0.00  
##  Mean   : 5.02  
##  3rd Qu.: 0.00  
##  Max.   :81.00  
##  NA's   :39

Randomization is not applicable here due to the nature of the data. The data were created by the USDA as part of the effort to acquire, evaluate, compile, and disseminate composition data on an exhaustive list of foods available via retail in the United States.

The hypothesis is that protein content depends on food type. Analysis of Variance (ANOVA) will be used to understand the effect of Short_Description on the protein content (g). The rationale for using ANOVA is because there is a numerical outcome (Protein content) and a categorical predictor variable (Short_Description).

For these data, there are no replicates or repeated measures. Each food type is measured for each nutrient only once and there are no duplicate names.

Blocking will not be utilized here since the objective is to focus on resampling.

Graph showing protein level by Short_Description :

par(mfrow=c(1,1))
plot(Protein~Short_Description,xlab="Short Description (first letter of food name)",ylab="Protein (g)")

plot of chunk unnamed-chunk-3

Create Initial ANOVA model:

model=lm(Protein~Short_Description, data=ag)
anova(model)

## Analysis of Variance Table
## 
## Response: Protein
##                     Df Sum Sq Mean Sq F value Pr(>F)    
## Short_Description   24 126136    5256    54.2 <2e-16 ***
## Residuals         8575 830877      97                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this particular model, we reject the H0 that Short_Description has no effect on the mean protein level (g). Thus, variation in the total protein level can be attributed solely to variation in the Short_Description. However, parameter estimation must be performed to ensure the model conforms to the critical assumptions of the ANOVA (normality, hetergeneity of variance, etc.).

Estimate Parameters using Shapiro-Wilk test of normality on sample of 500 observations from original dataset.

sampling = ag[sample(1:nrow(ag), 500, replace=FALSE),]
shapiro.test(sampling$Protein)

## 
##  Shapiro-Wilk normality test
## 
## data:  sampling$Protein
## W = 0.9025, p-value < 2.2e-16

qqnorm(sampling$Protein)
qqline(sampling$Protein)

plot of chunk unnamed-chunk-5

We reject the H0 that the sample of the response variable, Protein, is normally distributed. Thus, we attempt to transform the data:

shapiro.test(sampling$Protein^2)

## 
##  Shapiro-Wilk normality test
## 
## data:  sampling$Protein^2
## W = 0.7269, p-value < 2.2e-16

shapiro.test(sqrt(sampling$Protein))

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(sampling$Protein)
## W = 0.9657, p-value = 2.074e-09

shapiro.test(log(sampling$Protein))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(sampling$Protein)
## W = NaN, p-value = NA

shapiro.test(log10(sampling$Protein))

## 
##  Shapiro-Wilk normality test
## 
## data:  log10(sampling$Protein)
## W = NaN, p-value = NA

shapiro.test(1/sampling$Protein)

## 
##  Shapiro-Wilk normality test
## 
## data:  1/sampling$Protein
## W = NaN, p-value = NA

shapiro.test(1/sampling$Protein)

## 
##  Shapiro-Wilk normality test
## 
## data:  1/sampling$Protein
## W = NaN, p-value = NA

It appears that the data cannot be readily transformed into a normal distribution.

One of the primary assumptions of the analysis of variance is that the data are normally distributed for the variables of interest, and since they are not(and cannot be readily transformed into such), the results from the analysis of variance can potentially be invalid because the theoretical F-distribution may not apply. Therefore, we perform resampling via Monte Carlo simulation (assuming each level of Short_Description follows a normal distribution) to reproduce the theoretical distribution.

summary(aov(Protein~Short_Description, data=ag))

##                     Df Sum Sq Mean Sq F value Pr(>F)    
## Short_Description   24 126136    5256    54.2 <2e-16 ***
## Residuals         8575 830877      97                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assuming the pooled variance of 97 applies to all groups, the theoretical F-distribution is calculated as follows:

with(ag,tapply(Protein,Short_Description,length))

##    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O 
##  193 1649 1601  106   90  356  190   72  157   27  296  408  407   43  228 
##    P    Q    R    S    T    U    V    W    Y    Z 
##  969    8  174  867  334   37  184  157   45    2

meanstar = mean(ag$Protein)
sdstar = sqrt(97)
simShort_Description = ag$Short_Description
runs = 1000 
Fstar = numeric(runs)
for (i in 1:runs) {
      A = rnorm(193, mean = meanstar, sd = sdstar)
      B = rnorm(1649, mean = meanstar, sd = sdstar)
      C = rnorm(1601, mean = meanstar, sd = sdstar)
      D = rnorm(106, mean = meanstar, sd = sdstar)
      E = rnorm(90, mean = meanstar, sd = sdstar)
      F = rnorm(356, mean = meanstar, sd = sdstar)
      G = rnorm(190, mean = meanstar, sd = sdstar)
      H = rnorm(72, mean = meanstar, sd = sdstar)
      I = rnorm(157, mean = meanstar, sd = sdstar)
      J = rnorm(27, mean = meanstar, sd = sdstar)
      K = rnorm(296, mean = meanstar, sd = sdstar)
      L = rnorm(408, mean = meanstar, sd = sdstar)
      M = rnorm(407, mean = meanstar, sd = sdstar)
      N = rnorm(43, mean = meanstar, sd = sdstar)
      O = rnorm(228, mean = meanstar, sd = sdstar)
      P = rnorm(969, mean = meanstar, sd = sdstar)
      Q = rnorm(8, mean = meanstar, sd = sdstar)
      R = rnorm(174, mean = meanstar, sd = sdstar)
      S = rnorm(867, mean = meanstar, sd = sdstar)
      T = rnorm(334, mean = meanstar, sd = sdstar)
      U = rnorm(37, mean = meanstar, sd = sdstar)
      V = rnorm(184, mean = meanstar, sd = sdstar)
      W = rnorm(157, mean = meanstar, sd = sdstar)
      Y = rnorm(45, mean = meanstar, sd = sdstar)
      Z = rnorm(2, mean = meanstar, sd = sdstar)
      simProtein = c(A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z)
      simdata = data.frame(simProtein,simShort_Description)
      Fstar[i] = oneway.test(simProtein~simShort_Description,data=simdata,var.equal=TRUE,)$statistic
  }

For each level of Short_Description, a random normal distribution of the Protein content is created with the appropriate number of levels. A dataframe is then created by combining these Protein distributions to the actual short descriptions and a one-way ANOVA is performed on this simulated dataset to determine the theoretical F-distribution after 1,000 repititions.

Then a histogram is created for the simulated F-distribution with DF=24 and DF=8575 for Short_Description and residuals, respectively:

Now the agriculture dataset will be used to get the actual F-distribution. Each level of Short_Description will be centered around a mean difference of zero, but the variance and shape of the distribution for each level of Short_Description will remain the same.

newmeanstar = with(ag, tapply(Protein,Short_Description,mean))
newmeanstar

##      A      B      C      D      E      F      G      H      I      J 
##  4.113 16.853  9.589  9.635 16.472 10.239 14.447 11.703  5.740  1.644 
##      K      L      M      N      O      P      Q      R      S      T 
##  7.338 17.768  7.938  8.524  6.812 12.025 16.000  7.062  8.248 13.130 
##      U      V      W      Y      Z 
## 13.558 15.533 11.922  6.632 10.100

grpA = ag$Protein[ag$Short_Description=="A"] - newmeanstar[1]
grpB = ag$Protein[ag$Short_Description=="B"] - newmeanstar[2]
grpC = ag$Protein[ag$Short_Description=="C"] - newmeanstar[3]
grpD = ag$Protein[ag$Short_Description=="D"] - newmeanstar[4]
grpE = ag$Protein[ag$Short_Description=="E"] - newmeanstar[5]
grpF = ag$Protein[ag$Short_Description=="F"] - newmeanstar[6]
grpG = ag$Protein[ag$Short_Description=="G"] - newmeanstar[7]
grpH = ag$Protein[ag$Short_Description=="H"] - newmeanstar[8]
grpI = ag$Protein[ag$Short_Description=="I"] - newmeanstar[9]
grpJ = ag$Protein[ag$Short_Description=="J"] - newmeanstar[10]
grpK = ag$Protein[ag$Short_Description=="K"] - newmeanstar[11]
grpL = ag$Protein[ag$Short_Description=="L"] - newmeanstar[12]
grpM = ag$Protein[ag$Short_Description=="M"] - newmeanstar[13]
grpN = ag$Protein[ag$Short_Description=="N"] - newmeanstar[14]
grpO = ag$Protein[ag$Short_Description=="O"] - newmeanstar[15]
grpP = ag$Protein[ag$Short_Description=="P"] - newmeanstar[16]
grpQ = ag$Protein[ag$Short_Description=="Q"] - newmeanstar[17]
grpR = ag$Protein[ag$Short_Description=="R"] - newmeanstar[18]
grpS = ag$Protein[ag$Short_Description=="S"] - newmeanstar[19]
grpT = ag$Protein[ag$Short_Description=="T"] - newmeanstar[20]
grpU = ag$Protein[ag$Short_Description=="U"] - newmeanstar[21]
grpV = ag$Protein[ag$Short_Description=="V"] - newmeanstar[22]
grpW = ag$Protein[ag$Short_Description=="W"] - newmeanstar[23]
grpY = ag$Protein[ag$Short_Description=="Y"] - newmeanstar[24]
grpZ = ag$Protein[ag$Short_Description=="Z"] - newmeanstar[25]

newsimShort_Description = ag$Short_Description

newFstar = numeric(runs)
for (i in 1:runs) {
  groupA = sample(grpA, size=193, replace=TRUE)
  groupB = sample(grpB, size=1649, replace=TRUE)
  groupC = sample(grpC, size=1601, replace=TRUE)
  groupD = sample(grpD, size=106, replace=TRUE)
  groupE = sample(grpE, size=90, replace=TRUE)
  groupF = sample(grpF, size=356, replace=TRUE)
  groupG = sample(grpG, size=190, replace=TRUE)
  groupH = sample(grpH, size=72, replace=TRUE)
  groupI = sample(grpI, size=157, replace=TRUE)
  groupJ = sample(grpJ, size=27, replace=TRUE)
  groupK = sample(grpK, size=296, replace=TRUE)
  groupL = sample(grpL, size=408, replace=TRUE)
  groupM = sample(grpM, size=407, replace=TRUE)
  groupN = sample(grpN, size=43, replace=TRUE)
  groupO = sample(grpO, size=228, replace=TRUE)
  groupP = sample(grpP, size=969, replace=TRUE)
  groupQ = sample(grpQ, size=8, replace=TRUE)
  groupR = sample(grpR, size=174, replace=TRUE)
  groupS = sample(grpS, size=867, replace=TRUE)
  groupT = sample(grpT, size=334, replace=TRUE)
  groupU = sample(grpU, size=37, replace=TRUE)
  groupV = sample(grpV, size=184, replace=TRUE)
  groupW = sample(grpW, size=157, replace=TRUE)
  groupY = sample(grpY, size=45, replace=TRUE)
  groupZ = sample(grpZ, size=2, replace=TRUE)
  newsimProtein = c(groupA,groupB,groupC,groupD,groupE,groupF,groupG,groupH,groupI,groupJ,groupK,groupL,groupM,groupN,groupO,groupP,groupQ,groupR,groupS,groupT,groupU,groupV,groupW,groupY,groupZ)
  newsimdata = data.frame(newsimProtein,newsimShort_Description)
  newFstar[i] = oneway.test(newsimProtein~newsimShort_Description,var.equal=TRUE,data = newsimdata)$statistic
  }

This bootstrapped F-distribution is still based on equal mean Protein quantities between levels of Short_Description, but normality is not assumed.

par(mfrow=c(1,2))
hist(Fstar,prob=TRUE,main="PDF of Theoretical F-distribution")
x = seq(.25,3,.25)
points(x,y=df(x,24,8575), type = "b", col = "red")

hist(newFstar, breaks=seq(0,3,.25), ylim=c(0,1.8), prob=TRUE, main = "PDF of Actual F-distribution")
newx=seq(.025,2.5,.25)
points(newx,y=df(newx,24,8575),type="b",col="red")

plot of chunk unnamed-chunk-10

This bootstrapped empirical F-distribution is different than the theoretical F-distribution because it doesn’t exactly match the analytic representation quite as well.

qf(.95,24,8575)

## [1] 1.519

quantile(newFstar,.95)

##  95% 
## 1.47

newmodel = lm(newsimProtein~newsimShort_Description,data=newsimdata)
anova(newmodel)

## Analysis of Variance Table
## 
## Response: newsimProtein
##                           Df Sum Sq Mean Sq F value Pr(>F)
## newsimShort_Description   24   3090   128.8    1.37   0.11
## Residuals               8575 803698    93.7

The F-statistic for DF=24 and DF=8575 at alpha = .05 is 1.52, but from the bootstrapped F-distribution it is 1.40. By re-running the Analysis of Variance on the resampled F-distribution, we fail to reject the H0 (p>.05) that Short_Description has no effect on the Protein content. Thus, variation in the response (protein) can only be attributed to randomization. In this case, performing ANOVA after resampling via bootstrapping yielded a much different result than the initial ANOVA. Initially, the H0 was easily rejected but after resampling, the H0 could not be rejected.

Diagnostics and Model Adequacy Checking:

par(mfrow=c(2,2))
plot(newmodel)

plot of chunk unnamed-chunk-12

The model appears to have a decent fit to the data since the points are evenly dispersed across the Cartesian plane of residual error and the fitted model. However, there are outliers toward the higher end that are not explained well by the model.

References:

U.S. Department of Agriculture, Agricultural Research Service. 2014. USDA National Nutrient Database for Standard Reference, Release 27. Nutrient Data Laboratory Home Page, “http://www.ars.usda.gov/ba/bhnrc/ndl”

King, William B. Resampling Techniques. Coastal Carolina University. “http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html”