Dataset under analysis includes data on Release 27 of the USDA National Nutrient Database for Standard Reference. There are 8,618 different foods included, but for the Analysis of Variance, this list will be truncated to 25 levels based on the first letter of the food name. Note, there was no food name starting with the letter “X”.
remove(list=ls())
ag = read.csv("C:/Users/Trevor/Documents/agriculture.csv", header=TRUE)
#view first few lines
head(ag)
## NDB_No Shrt_Desc Short_Description Water_.g.
## 1 15155 ABALONE,MIXED SPECIES,RAW A 74.56
## 2 15156 ABALONE,MXD SP,CKD,FRIED A 60.10
## 3 9427 ABIYUCH,RAW A 79.90
## 4 9002 ACEROLA JUICE,RAW A 94.30
## 5 9001 ACEROLA,(WEST INDIAN CHERRY),RAW A 91.41
## 6 12060 ACORN FLOUR,FULL FAT A 6.00
## Energ_Kcal Protein Lipid_Tot_.g. Ash_.g. Carbohydrt_.g. Fiber_TD_.g.
## 1 105 17.10 0.76 1.57 6.01 0.0
## 2 189 19.63 6.78 1.77 11.05 0.0
## 3 69 1.50 0.10 0.90 17.60 5.3
## 4 23 0.40 0.30 0.20 4.80 0.3
## 5 32 0.40 0.30 0.20 7.69 1.1
## 6 501 7.49 30.17 1.69 54.65 NA
## Sugar_Tot_.g. Calcium_.mg. Iron_.mg. Magnesium_.mg. Phosphorus_.mg.
## 1 0.00 31 3.19 48 190
## 2 NA 37 3.80 56 217
## 3 8.55 8 1.61 24 47
## 4 4.50 10 0.50 12 9
## 5 NA 12 0.20 18 11
## 6 NA 43 1.21 110 103
## Potassium_.mg. Sodium_.mg. Zinc_.mg. Copper_mg. Manganese_.mg.
## 1 250 301 0.82 0.196 0.040
## 2 284 591 0.95 0.228 0.070
## 3 304 20 0.31 0.057 0.182
## 4 97 3 0.10 0.086 NA
## 5 146 7 0.10 0.086 NA
## 6 712 0 0.64 0.611 1.743
## Selenium_.µg. Vit_C_.mg. Thiamin_.mg. Riboflavin_.mg. Niacin_.mg.
## 1 44.8 2.0 0.190 0.100 1.500
## 2 51.8 1.8 0.220 0.130 1.900
## 3 NA 54.1 NA NA NA
## 4 0.1 1600.0 0.020 0.060 0.400
## 5 0.6 1677.6 0.020 0.060 0.400
## 6 NA 0.0 0.146 0.154 2.382
## Panto_Acid_mg. Vit_B6_.mg. Folate_Tot_.µg. Folic_Acid_.µg.
## 1 3.000 0.150 5 0
## 2 2.870 0.150 14 9
## 3 NA NA NA NA
## 4 0.205 0.004 14 0
## 5 0.309 0.009 14 0
## 6 0.931 0.688 114 0
## Food_Folate_.µg. Folate_DFE_.µg. Choline_Tot_..mg. Vit_B12_.µg. Vit_A_IU
## 1 5 5 65 0.73 7
## 2 5 20 NA 0.69 5
## 3 NA NA NA NA 100
## 4 14 14 NA 0.00 509
## 5 14 14 NA 0.00 767
## 6 114 114 NA 0.00 51
## Vit_A_RAE Retinol_.µg. Alpha_Carot_.µg. Beta_Carot_.µg. Beta_Crypt_.µg.
## 1 2 2 0 0 0
## 2 2 2 NA NA NA
## 3 5 0 0 60 NA
## 4 25 0 0 305 0
## 5 38 0 NA NA NA
## 6 3 0 NA NA NA
## Lycopene_.µg. Lut.Zea_..µg. Vit_E_.mg. Vit_D_µg Vit_D_IU Vit_K_.µg.
## 1 0 0 4.00 0 0 23.0
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 0 17 0.18 NA NA 1.4
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## FA_Sat_.g. FA_Mono_.g. FA_Poly_.g. Cholestrl_.mg. GmWt_1 GmWt_Desc1
## 1 0.149 0.107 0.104 85 85 3 oz
## 2 1.646 2.741 1.676 94 85 3 oz
## 3 0.014 NA NA NA 114 .5 cup
## 4 0.068 0.082 0.090 0 242 1 cup
## 5 0.068 0.082 0.090 0 98 1 cup
## 6 3.923 19.110 5.813 0 28 1 oz
## GmWt_2 GmWt_Desc2 Refuse_Pct
## 1 NA 0
## 2 NA 0
## 3 NA NA
## 4 30.2 1 fl oz 0
## 5 4.8 1 fruit, without refuse 20
## 6 NA 0
#observe the structure of the data
str(ag)
## 'data.frame': 8600 obs. of 54 variables:
## $ NDB_No : int 15155 15156 9427 9002 9001 12060 35182 12059 12058 35193 ...
## $ Shrt_Desc : Factor w/ 8595 levels "ABALONE,MIXED SPECIES,RAW",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Short_Description: Factor w/ 25 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Water_.g. : num 74.6 60.1 79.9 94.3 91.4 ...
## $ Energ_Kcal : int 105 189 69 23 32 501 95 509 387 135 ...
## $ Protein : num 17.1 19.6 1.5 0.4 0.4 ...
## $ Lipid_Tot_.g. : num 0.76 6.78 0.1 0.3 0.3 ...
## $ Ash_.g. : num 1.57 1.77 0.9 0.2 0.2 1.69 0.72 1.78 1.35 1.32 ...
## $ Carbohydrt_.g. : num 6.01 11.05 17.6 4.8 7.69 ...
## $ Fiber_TD_.g. : num 0 0 5.3 0.3 1.1 NA 0.7 NA NA 10.6 ...
## $ Sugar_Tot_.g. : num 0 NA 8.55 4.5 NA ...
## $ Calcium_.mg. : int 31 37 8 10 12 43 14 54 41 460 ...
## $ Iron_.mg. : num 3.19 3.8 1.61 0.5 0.2 1.21 1 1.04 0.79 3.55 ...
## $ Magnesium_.mg. : int 48 56 24 12 18 110 12 82 62 39 ...
## $ Phosphorus_.mg. : int 190 217 47 9 11 103 62 103 79 9 ...
## $ Potassium_.mg. : int 250 284 304 97 146 712 110 709 539 59 ...
## $ Sodium_.mg. : int 301 591 20 3 7 0 130 0 0 13 ...
## $ Zinc_.mg. : num 0.82 0.95 0.31 0.1 0.1 0.64 1.6 0.67 0.51 0.25 ...
## $ Copper_mg. : num 0.196 0.228 0.057 0.086 0.086 0.611 0.03 0.818 0.621 0.112 ...
## $ Manganese_.mg. : num 0.04 0.07 0.182 NA NA ...
## $ Selenium_.µg. : num 44.8 51.8 NA 0.1 0.6 NA 8.3 NA NA 0.2 ...
## $ Vit_C_.mg. : num 2 1.8 54.1 1600 1677.6 ...
## $ Thiamin_.mg. : num 0.19 0.22 NA 0.02 0.02 0.146 0.175 0.149 0.112 0.012 ...
## $ Riboflavin_.mg. : num 0.1 0.13 NA 0.06 0.06 0.154 0.125 0.154 0.118 0.099 ...
## $ Niacin_.mg. : num 1.5 1.9 NA 0.4 0.4 ...
## $ Panto_Acid_mg. : num 3 2.87 NA 0.205 0.309 0.931 0.212 0.94 0.715 0.041 ...
## $ Vit_B6_.mg. : num 0.15 0.15 NA 0.004 0.009 0.688 0.055 0.695 0.528 0.087 ...
## $ Folate_Tot_.µg. : int 5 14 NA 14 14 114 33 115 87 3 ...
## $ Folic_Acid_.µg. : int 0 9 NA 0 0 0 15 0 0 0 ...
## $ Food_Folate_.µg. : int 5 5 NA 14 14 114 18 115 87 3 ...
## $ Folate_DFE_.µg. : int 5 20 NA 14 14 114 44 115 87 3 ...
## $ Choline_Tot_..mg.: num 65 NA NA NA NA NA NA NA NA 8.8 ...
## $ Vit_B12_.µg. : num 0.73 0.69 NA 0 0 0 0.68 0 0 0 ...
## $ Vit_A_IU : int 7 5 100 509 767 51 0 0 39 113 ...
## $ Vit_A_RAE : int 2 2 5 25 38 3 0 0 2 6 ...
## $ Retinol_.µg. : int 2 2 0 0 0 0 0 0 0 0 ...
## $ Alpha_Carot_.µg. : int 0 NA 0 0 NA NA NA NA NA 0 ...
## $ Beta_Carot_.µg. : int 0 NA 60 305 NA NA NA NA NA 68 ...
## $ Beta_Crypt_.µg. : int 0 NA NA 0 NA NA NA NA NA 0 ...
## $ Lycopene_.µg. : int 0 NA NA 0 NA NA NA NA NA 0 ...
## $ Lut.Zea_..µg. : int 0 NA NA 17 NA NA NA NA NA NA ...
## $ Vit_E_.mg. : num 4 NA NA 0.18 NA NA 0.3 NA NA 0.36 ...
## $ Vit_D_µg : num 0 NA NA NA NA NA NA NA NA 0 ...
## $ Vit_D_IU : int 0 NA NA NA NA NA NA NA NA 0 ...
## $ Vit_K_.µg. : num 23 NA NA 1.4 NA NA 0 NA NA 4.9 ...
## $ FA_Sat_.g. : num 0.149 1.646 0.014 0.068 0.068 ...
## $ FA_Mono_.g. : num 0.107 2.741 NA 0.082 0.082 ...
## $ FA_Poly_.g. : num 0.104 1.676 NA 0.09 0.09 ...
## $ Cholestrl_.mg. : int 85 94 NA 0 0 0 20 0 0 0 ...
## $ GmWt_1 : int 85 85 114 242 98 28 NA 28 28 NA ...
## $ GmWt_Desc1 : Factor w/ 902 levels "",".083 package",..: 825 825 35 158 158 370 1 370 370 1 ...
## $ GmWt_2 : num NA NA NA 30.2 4.8 NA NA NA NA NA ...
## $ GmWt_Desc2 : Factor w/ 911 levels "",".083 package",..: 1 1 1 314 331 1 1 1 1 1 ...
## $ Refuse_Pct : int 0 0 NA 0 20 0 0 38 38 0 ...
#8618 observations of 54 variables
attach(ag)
The data are tabluated into 54 columns, with detailed information on food type, related nutional information for each type, and common household measures. This project will test the effects of the factor “Short_Description” (25 levels) on the response, “Protein”, measured in grams for each serving size. Summary statistics are as follows:
summary(ag)
## NDB_No
## Min. : 1001
## 1st Qu.: 8692
## Median :14304
## Mean :15473
## 3rd Qu.:20056
## Max. :93600
##
## Shrt_Desc
## BABYFOOD,MEAT,BF,STR : 2
## BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2
## OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2
## POPCORN,OIL-POPPED,LOFAT : 2
## ZWIEBACK : 2
## ABALONE,MIXED SPECIES,RAW : 1
## (Other) :8589
## Short_Description Water_.g. Energ_Kcal Protein
## B :1649 Min. : 0.0 Min. : 0 Min. : 0.00
## C :1601 1st Qu.: 29.9 1st Qu.: 93 1st Qu.: 2.49
## P : 969 Median : 63.2 Median :191 Median : 8.30
## S : 867 Mean : 54.1 Mean :227 Mean :11.54
## L : 408 3rd Qu.: 77.5 3rd Qu.:336 3rd Qu.:19.99
## M : 407 Max. :100.0 Max. :902 Max. :88.32
## (Other):2699 NA's :6
## Lipid_Tot_.g. Ash_.g. Carbohydrt_.g. Fiber_TD_.g.
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 0.9 1st Qu.: 0.04 1st Qu.: 0.0
## Median : 5.28 Median : 1.2 Median : 8.98 Median : 0.7
## Mean : 10.66 Mean : 1.8 Mean : 21.85 Mean : 2.2
## 3rd Qu.: 13.94 3rd Qu.: 2.1 3rd Qu.: 33.11 3rd Qu.: 2.5
## Max. :100.00 Max. :99.8 Max. :100.00 Max. :79.0
## NA's :332 NA's :643
## Sugar_Tot_.g. Calcium_.mg. Iron_.mg. Magnesium_.mg.
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 10 1st Qu.: 0.57 1st Qu.: 13.0
## Median : 1.7 Median : 21 Median : 1.38 Median : 21.0
## Mean : 8.5 Mean : 76 Mean : 2.74 Mean : 35.5
## 3rd Qu.: 9.0 3rd Qu.: 67 3rd Qu.: 2.60 3rd Qu.: 30.5
## Max. :99.8 Max. :7364 Max. :123.60 Max. :781.0
## NA's :1925 NA's :341 NA's :134 NA's :669
## Phosphorus_.mg. Potassium_.mg. Sodium_.mg. Zinc_.mg.
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 53 1st Qu.: 130 1st Qu.: 42 1st Qu.: 0.4
## Median : 149 Median : 232 Median : 90 Median : 1.0
## Mean : 167 Mean : 283 Mean : 321 Mean : 2.1
## 3rd Qu.: 221 3rd Qu.: 337 3rd Qu.: 416 3rd Qu.: 3.0
## Max. :9918 Max. :16500 Max. :38758 Max. :91.0
## NA's :559 NA's :397 NA's :70 NA's :688
## Copper_mg. Manganese_.mg. Selenium_.µg. Vit_C_.mg.
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.1 1st Qu.: 0.0 1st Qu.: 1.5 1st Qu.: 0.0
## Median : 0.1 Median : 0.1 Median : 10.5 Median : 0.0
## Mean : 0.2 Mean : 0.7 Mean : 15.8 Mean : 8.7
## 3rd Qu.: 0.2 3rd Qu.: 0.3 3rd Qu.: 24.5 3rd Qu.: 3.5
## Max. :15.1 Max. :328.0 Max. :1917.0 Max. :2400.0
## NA's :1242 NA's :2128 NA's :1738 NA's :780
## Thiamin_.mg. Riboflavin_.mg. Niacin_.mg. Panto_Acid_mg.
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.1 1st Qu.: 0.6 1st Qu.: 0.2
## Median : 0.1 Median : 0.2 Median : 2.6 Median : 0.4
## Mean : 0.2 Mean : 0.3 Mean : 3.7 Mean : 0.7
## 3rd Qu.: 0.2 3rd Qu.: 0.3 3rd Qu.: 5.3 3rd Qu.: 0.7
## Max. :23.4 Max. :17.5 Max. :127.5 Max. :34.5
## NA's :667 NA's :645 NA's :669 NA's :2145
## Vit_B6_.mg. Folate_Tot_.µg. Folic_Acid_.µg. Food_Folate_.µg.
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.1 1st Qu.: 5 1st Qu.: 0.0 1st Qu.: 4.0
## Median : 0.2 Median : 12 Median : 0.0 Median : 9.0
## Mean : 0.3 Mean : 53 Mean : 23.8 Mean : 24.9
## 3rd Qu.: 0.4 3rd Qu.: 46 3rd Qu.: 0.0 3rd Qu.: 21.0
## Max. :12.0 Max. :3786 Max. :2993.0 Max. :2340.0
## NA's :929 NA's :1233 NA's :1974 NA's :1743
## Folate_DFE_.µg. Choline_Tot_..mg. Vit_B12_.µg. Vit_A_IU
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 4 1st Qu.: 10 1st Qu.: 0.0 1st Qu.: 0
## Median : 11 Median : 27 Median : 0.3 Median : 33
## Mean : 66 Mean : 45 Mean : 1.4 Mean : 744
## 3rd Qu.: 42 3rd Qu.: 71 3rd Qu.: 1.6 3rd Qu.: 259
## Max. :5881 Max. :2403 Max. :98.9 Max. :100000
## NA's :1993 NA's :4070 NA's :1179 NA's :673
## Vit_A_RAE Retinol_.µg. Alpha_Carot_.µg. Beta_Carot_.µg.
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 4 Median : 0 Median : 0 Median : 0
## Mean : 114 Mean : 92 Mean : 33 Mean : 224
## 3rd Qu.: 34 3rd Qu.: 12 3rd Qu.: 0 3rd Qu.: 20
## Max. :30000 Max. :30000 Max. :14251 Max. :42891
## NA's :1516 NA's :1799 NA's :3341 NA's :3248
## Beta_Crypt_.µg. Lycopene_.µg. Lut.Zea_..µg. Vit_E_.mg.
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.1
## Median : 0 Median : 0 Median : 0 Median : 0.3
## Mean : 11 Mean : 165 Mean : 193 Mean : 1.3
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 28 3rd Qu.: 0.8
## Max. :6252 Max. :46260 Max. :19697 Max. :149.4
## NA's :3352 NA's :3376 NA's :3398 NA's :2991
## Vit_D_µg Vit_D_IU Vit_K_.µg. FA_Sat_.g.
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.2
## Median : 0 Median : 0 Median : 2 Median : 1.6
## Mean : 1 Mean : 23 Mean : 16 Mean : 3.6
## 3rd Qu.: 0 3rd Qu.: 7 3rd Qu.: 5 3rd Qu.: 4.4
## Max. :250 Max. :10000 Max. :1714 Max. :95.6
## NA's :3286 NA's :3285 NA's :3635 NA's :331
## FA_Mono_.g. FA_Poly_.g. Cholestrl_.mg. GmWt_1
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.3 1st Qu.: 0.2 1st Qu.: 0.0 1st Qu.: 28.0
## Median : 2.0 Median : 0.7 Median : 5.0 Median : 85.0
## Mean : 4.1 Mean : 2.3 Mean : 41.5 Mean : 95.4
## 3rd Qu.: 5.2 3rd Qu.: 2.1 3rd Qu.: 67.0 3rd Qu.: 130.0
## Max. :83.7 Max. :74.6 Max. :3100.0 Max. :1171.0
## NA's :658 NA's :651 NA's :355 NA's :246
## GmWt_Desc1 GmWt_2
## 3 oz :1377 Min. : 0
## 1 cup :1160 1st Qu.: 38
## 1 oz : 903 Median : 163
## 1 tbsp : 342 Mean : 248
## 1 serving: 288 3rd Qu.: 301
## 1 fl oz : 246 Max. :5717
## (Other) :4284 NA's :3817
## GmWt_Desc2
## :3817
## 1 cup : 293
## 1 lb : 266
## 1 steak : 263
## 1 piece, cooked, excluding refuse (yield from 1 lb raw meat with refuse): 212
## 1 oz : 198
## (Other) :3551
## Refuse_Pct
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 5.02
## 3rd Qu.: 0.00
## Max. :81.00
## NA's :39
Randomization is not applicable here due to the nature of the data. The data were created by the USDA as part of the effort to acquire, evaluate, compile, and disseminate composition data on an exhaustive list of foods available via retail in the United States.
The hypothesis is that protein content depends on food type. Analysis of Variance (ANOVA) will be used to understand the effect of Short_Description on the protein content (g). The rationale for using ANOVA is because there is a numerical outcome (Protein content) and a categorical predictor variable (Short_Description).
For these data, there are no replicates or repeated measures. Each food type is measured for each nutrient only once and there are no duplicate names.
Blocking will not be utilized here since the objective is to focus on resampling.
Graph showing protein level by Short_Description :
par(mfrow=c(1,1))
plot(Protein~Short_Description,xlab="Short Description (first letter of food name)",ylab="Protein (g)")
Create Initial ANOVA model:
model=lm(Protein~Short_Description, data=ag)
anova(model)
## Analysis of Variance Table
##
## Response: Protein
## Df Sum Sq Mean Sq F value Pr(>F)
## Short_Description 24 126136 5256 54.2 <2e-16 ***
## Residuals 8575 830877 97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this particular model, we reject the H0 that Short_Description has no effect on the mean protein level (g). Thus, variation in the total protein level can be attributed solely to variation in the Short_Description. However, parameter estimation must be performed to ensure the model conforms to the critical assumptions of the ANOVA (normality, hetergeneity of variance, etc.).
Estimate Parameters using Shapiro-Wilk test of normality on sample of 500 observations from original dataset.
sampling = ag[sample(1:nrow(ag), 500, replace=FALSE),]
shapiro.test(sampling$Protein)
##
## Shapiro-Wilk normality test
##
## data: sampling$Protein
## W = 0.9025, p-value < 2.2e-16
qqnorm(sampling$Protein)
qqline(sampling$Protein)
We reject the H0 that the sample of the response variable, Protein, is normally distributed. Thus, we attempt to transform the data:
shapiro.test(sampling$Protein^2)
##
## Shapiro-Wilk normality test
##
## data: sampling$Protein^2
## W = 0.7269, p-value < 2.2e-16
shapiro.test(sqrt(sampling$Protein))
##
## Shapiro-Wilk normality test
##
## data: sqrt(sampling$Protein)
## W = 0.9657, p-value = 2.074e-09
shapiro.test(log(sampling$Protein))
##
## Shapiro-Wilk normality test
##
## data: log(sampling$Protein)
## W = NaN, p-value = NA
shapiro.test(log10(sampling$Protein))
##
## Shapiro-Wilk normality test
##
## data: log10(sampling$Protein)
## W = NaN, p-value = NA
shapiro.test(1/sampling$Protein)
##
## Shapiro-Wilk normality test
##
## data: 1/sampling$Protein
## W = NaN, p-value = NA
shapiro.test(1/sampling$Protein)
##
## Shapiro-Wilk normality test
##
## data: 1/sampling$Protein
## W = NaN, p-value = NA
It appears that the data cannot be readily transformed into a normal distribution.
One of the primary assumptions of the analysis of variance is that the data are normally distributed for the variables of interest, and since they are not(and cannot be readily transformed into such), the results from the analysis of variance can potentially be invalid because the theoretical F-distribution may not apply. Therefore, we perform resampling via Monte Carlo simulation (assuming each level of Short_Description follows a normal distribution) to reproduce the theoretical distribution.
summary(aov(Protein~Short_Description, data=ag))
## Df Sum Sq Mean Sq F value Pr(>F)
## Short_Description 24 126136 5256 54.2 <2e-16 ***
## Residuals 8575 830877 97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Assuming the pooled variance of 97 applies to all groups, the theoretical F-distribution is calculated as follows:
with(ag,tapply(Protein,Short_Description,length))
## A B C D E F G H I J K L M N O
## 193 1649 1601 106 90 356 190 72 157 27 296 408 407 43 228
## P Q R S T U V W Y Z
## 969 8 174 867 334 37 184 157 45 2
meanstar = mean(ag$Protein)
sdstar = sqrt(97)
simShort_Description = ag$Short_Description
runs = 1000
Fstar = numeric(runs)
for (i in 1:runs) {
A = rnorm(193, mean = meanstar, sd = sdstar)
B = rnorm(1649, mean = meanstar, sd = sdstar)
C = rnorm(1601, mean = meanstar, sd = sdstar)
D = rnorm(106, mean = meanstar, sd = sdstar)
E = rnorm(90, mean = meanstar, sd = sdstar)
F = rnorm(356, mean = meanstar, sd = sdstar)
G = rnorm(190, mean = meanstar, sd = sdstar)
H = rnorm(72, mean = meanstar, sd = sdstar)
I = rnorm(157, mean = meanstar, sd = sdstar)
J = rnorm(27, mean = meanstar, sd = sdstar)
K = rnorm(296, mean = meanstar, sd = sdstar)
L = rnorm(408, mean = meanstar, sd = sdstar)
M = rnorm(407, mean = meanstar, sd = sdstar)
N = rnorm(43, mean = meanstar, sd = sdstar)
O = rnorm(228, mean = meanstar, sd = sdstar)
P = rnorm(969, mean = meanstar, sd = sdstar)
Q = rnorm(8, mean = meanstar, sd = sdstar)
R = rnorm(174, mean = meanstar, sd = sdstar)
S = rnorm(867, mean = meanstar, sd = sdstar)
T = rnorm(334, mean = meanstar, sd = sdstar)
U = rnorm(37, mean = meanstar, sd = sdstar)
V = rnorm(184, mean = meanstar, sd = sdstar)
W = rnorm(157, mean = meanstar, sd = sdstar)
Y = rnorm(45, mean = meanstar, sd = sdstar)
Z = rnorm(2, mean = meanstar, sd = sdstar)
simProtein = c(A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z)
simdata = data.frame(simProtein,simShort_Description)
Fstar[i] = oneway.test(simProtein~simShort_Description,data=simdata,var.equal=TRUE,)$statistic
}
For each level of Short_Description, a random normal distribution of the Protein content is created with the appropriate number of levels. A dataframe is then created by combining these Protein distributions to the actual short descriptions and a one-way ANOVA is performed on this simulated dataset to determine the theoretical F-distribution after 1,000 repititions.
Then a histogram is created for the simulated F-distribution with DF=24 and DF=8575 for Short_Description and residuals, respectively:
Now the agriculture dataset will be used to get the actual F-distribution. Each level of Short_Description will be centered around a mean difference of zero, but the variance and shape of the distribution for each level of Short_Description will remain the same.
newmeanstar = with(ag, tapply(Protein,Short_Description,mean))
newmeanstar
## A B C D E F G H I J
## 4.113 16.853 9.589 9.635 16.472 10.239 14.447 11.703 5.740 1.644
## K L M N O P Q R S T
## 7.338 17.768 7.938 8.524 6.812 12.025 16.000 7.062 8.248 13.130
## U V W Y Z
## 13.558 15.533 11.922 6.632 10.100
grpA = ag$Protein[ag$Short_Description=="A"] - newmeanstar[1]
grpB = ag$Protein[ag$Short_Description=="B"] - newmeanstar[2]
grpC = ag$Protein[ag$Short_Description=="C"] - newmeanstar[3]
grpD = ag$Protein[ag$Short_Description=="D"] - newmeanstar[4]
grpE = ag$Protein[ag$Short_Description=="E"] - newmeanstar[5]
grpF = ag$Protein[ag$Short_Description=="F"] - newmeanstar[6]
grpG = ag$Protein[ag$Short_Description=="G"] - newmeanstar[7]
grpH = ag$Protein[ag$Short_Description=="H"] - newmeanstar[8]
grpI = ag$Protein[ag$Short_Description=="I"] - newmeanstar[9]
grpJ = ag$Protein[ag$Short_Description=="J"] - newmeanstar[10]
grpK = ag$Protein[ag$Short_Description=="K"] - newmeanstar[11]
grpL = ag$Protein[ag$Short_Description=="L"] - newmeanstar[12]
grpM = ag$Protein[ag$Short_Description=="M"] - newmeanstar[13]
grpN = ag$Protein[ag$Short_Description=="N"] - newmeanstar[14]
grpO = ag$Protein[ag$Short_Description=="O"] - newmeanstar[15]
grpP = ag$Protein[ag$Short_Description=="P"] - newmeanstar[16]
grpQ = ag$Protein[ag$Short_Description=="Q"] - newmeanstar[17]
grpR = ag$Protein[ag$Short_Description=="R"] - newmeanstar[18]
grpS = ag$Protein[ag$Short_Description=="S"] - newmeanstar[19]
grpT = ag$Protein[ag$Short_Description=="T"] - newmeanstar[20]
grpU = ag$Protein[ag$Short_Description=="U"] - newmeanstar[21]
grpV = ag$Protein[ag$Short_Description=="V"] - newmeanstar[22]
grpW = ag$Protein[ag$Short_Description=="W"] - newmeanstar[23]
grpY = ag$Protein[ag$Short_Description=="Y"] - newmeanstar[24]
grpZ = ag$Protein[ag$Short_Description=="Z"] - newmeanstar[25]
newsimShort_Description = ag$Short_Description
newFstar = numeric(runs)
for (i in 1:runs) {
groupA = sample(grpA, size=193, replace=TRUE)
groupB = sample(grpB, size=1649, replace=TRUE)
groupC = sample(grpC, size=1601, replace=TRUE)
groupD = sample(grpD, size=106, replace=TRUE)
groupE = sample(grpE, size=90, replace=TRUE)
groupF = sample(grpF, size=356, replace=TRUE)
groupG = sample(grpG, size=190, replace=TRUE)
groupH = sample(grpH, size=72, replace=TRUE)
groupI = sample(grpI, size=157, replace=TRUE)
groupJ = sample(grpJ, size=27, replace=TRUE)
groupK = sample(grpK, size=296, replace=TRUE)
groupL = sample(grpL, size=408, replace=TRUE)
groupM = sample(grpM, size=407, replace=TRUE)
groupN = sample(grpN, size=43, replace=TRUE)
groupO = sample(grpO, size=228, replace=TRUE)
groupP = sample(grpP, size=969, replace=TRUE)
groupQ = sample(grpQ, size=8, replace=TRUE)
groupR = sample(grpR, size=174, replace=TRUE)
groupS = sample(grpS, size=867, replace=TRUE)
groupT = sample(grpT, size=334, replace=TRUE)
groupU = sample(grpU, size=37, replace=TRUE)
groupV = sample(grpV, size=184, replace=TRUE)
groupW = sample(grpW, size=157, replace=TRUE)
groupY = sample(grpY, size=45, replace=TRUE)
groupZ = sample(grpZ, size=2, replace=TRUE)
newsimProtein = c(groupA,groupB,groupC,groupD,groupE,groupF,groupG,groupH,groupI,groupJ,groupK,groupL,groupM,groupN,groupO,groupP,groupQ,groupR,groupS,groupT,groupU,groupV,groupW,groupY,groupZ)
newsimdata = data.frame(newsimProtein,newsimShort_Description)
newFstar[i] = oneway.test(newsimProtein~newsimShort_Description,var.equal=TRUE,data = newsimdata)$statistic
}
This bootstrapped F-distribution is still based on equal mean Protein quantities between levels of Short_Description, but normality is not assumed.
par(mfrow=c(1,2))
hist(Fstar,prob=TRUE,main="PDF of Theoretical F-distribution")
x = seq(.25,3,.25)
points(x,y=df(x,24,8575), type = "b", col = "red")
hist(newFstar, breaks=seq(0,3,.25), ylim=c(0,1.8), prob=TRUE, main = "PDF of Actual F-distribution")
newx=seq(.025,2.5,.25)
points(newx,y=df(newx,24,8575),type="b",col="red")
This bootstrapped empirical F-distribution is different than the theoretical F-distribution because it doesn’t exactly match the analytic representation quite as well.
qf(.95,24,8575)
## [1] 1.519
quantile(newFstar,.95)
## 95%
## 1.47
newmodel = lm(newsimProtein~newsimShort_Description,data=newsimdata)
anova(newmodel)
## Analysis of Variance Table
##
## Response: newsimProtein
## Df Sum Sq Mean Sq F value Pr(>F)
## newsimShort_Description 24 3090 128.8 1.37 0.11
## Residuals 8575 803698 93.7
The F-statistic for DF=24 and DF=8575 at alpha = .05 is 1.52, but from the bootstrapped F-distribution it is 1.40. By re-running the Analysis of Variance on the resampled F-distribution, we fail to reject the H0 (p>.05) that Short_Description has no effect on the Protein content. Thus, variation in the response (protein) can only be attributed to randomization. In this case, performing ANOVA after resampling via bootstrapping yielded a much different result than the initial ANOVA. Initially, the H0 was easily rejected but after resampling, the H0 could not be rejected.
Diagnostics and Model Adequacy Checking:
par(mfrow=c(2,2))
plot(newmodel)
The model appears to have a decent fit to the data since the points are evenly dispersed across the Cartesian plane of residual error and the fitted model. However, there are outliers toward the higher end that are not explained well by the model.
References:
U.S. Department of Agriculture, Agricultural Research Service. 2014. USDA National Nutrient Database for Standard Reference, Release 27. Nutrient Data Laboratory Home Page, “http://www.ars.usda.gov/ba/bhnrc/ndl”
King, William B. Resampling Techniques. Coastal Carolina University. “http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html”