Data Science Practicum - Week 2 - Confidence + Causality

Magnesium is an important micronutrient, especially for pregnant women. Unfortunately, unless you’re looking at a vitamin supplement, you have no idea how much magnesium is in your food. Until now!

There are lots of data sets on data.gov. Download the USDA food nutrition database, release SR22.
The USDA compiles summaries of nutrition experiments. Lots and lots of them. This data set contains the results of 1.5 million experiments for over seven thousand foods to determine the quantities of over a hundred nutrients. We want to build a model of how much magnesium is in a given item of food as a function of its listed nutrition facts and the category of food it comes in. The results comes as several files, but for our purposes the important ones are:

nut.def <- read.table("sr22/NUTR_DEF.txt", sep = "^", quote = "~")
names(nut.def) <- c("nutrient.id", "nutrient.units", "INFOFOODS.tag", "nutrient.description", 
    "nutrient.value.decimal.places", "sort.order")
head(nut.def)
##   nutrient.id nutrient.units INFOFOODS.tag        nutrient.description
## 1         203              g        PROCNT                     Protein
## 2         204              g           FAT           Total lipid (fat)
## 3         205              g        CHOCDF Carbohydrate, by difference
## 4         207              g           ASH                         Ash
## 5         208           kcal    ENERC_KCAL                      Energy
## 6         209              g        STARCH                      Starch
##   nutrient.value.decimal.places sort.order
## 1                             2        600
## 2                             2        800
## 3                             2       1100
## 4                             2       1000
## 5                             0        300
## 6                             2       2200
summary(nut.def)
##   nutrient.id  nutrient.units INFOFOODS.tag nutrient.description
##  Min.   :203   g      :91            :  8   Energy :  2         
##  1st Qu.:322   mg     :29     VITD   :  2   10:0   :  1         
##  Median :511   mcg    :17     ALA_G  :  1   12:0   :  1         
##  Mean   :495   IU     : 2     ALC    :  1   13:0   :  1         
##  3rd Qu.:634   kJ     : 1     ARG_G  :  1   14:0   :  1         
##  Max.   :858   kcal   : 1     (Other):129   14:1   :  1         
##                (Other): 2     NA's   :  1   (Other):136         
##  nutrient.value.decimal.places   sort.order   
##  Min.   :0.00                  Min.   :  100  
##  1st Qu.:1.00                  1st Qu.: 7050  
##  Median :3.00                  Median :10900  
##  Mean   :2.12                  Mean   :10481  
##  3rd Qu.:3.00                  3rd Qu.:14800  
##  Max.   :3.00                  Max.   :18400
food.def <- read.table("sr22/FOOD_DES.txt", sep = "^", quote = "~")
names(food.def) <- c("food.id", "food.group.id", "food.long.description", "food.short.desc", 
    "comon.name", "manufacturer", "complete.profile", "inedible.parts", "percent.inedible", 
    "scientific.name", "nitrogen.protein.factor", "protein.calories.factor", 
    "fat.calories.factor", "carb.calories.factor")
head(food.def)
##   food.id food.group.id      food.long.description
## 1    1001           100             Butter, salted
## 2    1002           100 Butter, whipped, with salt
## 3    1003           100      Butter oil, anhydrous
## 4    1004           100               Cheese, blue
## 5    1005           100              Cheese, brick
## 6    1006           100               Cheese, brie
##            food.short.desc comon.name manufacturer complete.profile
## 1         BUTTER,WITH SALT                                        Y
## 2 BUTTER,WHIPPED,WITH SALT                                        Y
## 3     BUTTER OIL,ANHYDROUS                                        Y
## 4              CHEESE,BLUE                                        Y
## 5             CHEESE,BRICK                                        Y
## 6              CHEESE,BRIE                                        Y
##   inedible.parts percent.inedible scientific.name nitrogen.protein.factor
## 1                               0                                    6.38
## 2                               0                                    6.38
## 3                               0                                    6.38
## 4                               0                                    6.38
## 5                               0                                    6.38
## 6                               0                                    6.38
##   protein.calories.factor fat.calories.factor carb.calories.factor
## 1                    4.27                8.79                 3.87
## 2                    4.27                8.79                 3.87
## 3                    4.27                8.79                 3.87
## 4                    4.27                8.79                 3.87
## 5                    4.27                8.79                 3.87
## 6                    4.27                8.79                 3.87
summary(food.def)
##     food.id      food.group.id 
##  Min.   : 1001   Min.   : 100  
##  1st Qu.: 8128   1st Qu.: 700  
##  Median :13327   Median :1100  
##  Mean   :14265   Mean   :1256  
##  3rd Qu.:18498   3rd Qu.:1700  
##  Max.   :93600   Max.   :3600  
##                                
##                                            food.long.description
##  AMARANTH FLAKES                                      :   1     
##  APPLEBEE'S, 9 oz house sirloin steak                 :   1     
##  APPLEBEE'S, Double Crunch Shrimp                     :   1     
##  APPLEBEE'S, French fries                             :   1     
##  APPLEBEE'S, KRAFT, Macaroni & Cheese, from kid's menu:   1     
##  APPLEBEE'S, chicken fingers, from kids' menu         :   1     
##  (Other)                                              :7533     
##                                                  food.short.desc
##  BABYFOOD,MEAT,BF,STR                                    :   2  
##  CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA         :   2  
##  OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT:   2  
##  POPCORN,OIL-POPPED,LOFAT                                :   2  
##  ABALONE,MIXED SPECIES,RAW                               :   1  
##  ABALONE,MXD SP,CKD,FRIED                                :   1  
##  (Other)                                                 :7529  
##                  comon.name                   manufacturer 
##                       :6905                         :6306  
##  KFC                  :  23   Campbell Soup Co.     : 318  
##  family style         :  18   Kellogg, Co.          : 187  
##  hamburger            :  14   The Quaker Oats, Co.  : 110  
##  buffalo              :  13   McDonald's Corporation:  84  
##  soft drink, pop, soda:  13   General Mills Inc.    :  51  
##  (Other)              : 553   (Other)               : 483  
##  complete.profile                    inedible.parts percent.inedible
##   :4644                                     :6078   Min.   : 0.00   
##  Y:2895           Bone                      : 387   1st Qu.: 0.00   
##                   Bone and connective tissue:  87   Median : 0.00   
##                   Connective tissue         :  62   Mean   : 4.68   
##                   Shells                    :  28   3rd Qu.: 0.00   
##                   Separable fat             :  26   Max.   :99.00   
##                   (Other)                   : 871   NA's   :36      
##                  scientific.name nitrogen.protein.factor
##                          :6826   Min.   :0.0            
##  Phaseolus vulgaris      :  20   1st Qu.:6.2            
##  Struthio camelus        :  18   Median :6.2            
##  Bison bison             :  13   Mean   :5.9            
##  Bos taurus              :  12   3rd Qu.:6.2            
##  Dromaius novaehollandiae:  12   Max.   :6.9            
##  (Other)                 : 638   NA's   :1908           
##  protein.calories.factor fat.calories.factor carb.calories.factor
##  Min.   :0.0             Min.   :0.0         Min.   :0.0         
##  1st Qu.:3.4             1st Qu.:8.4         1st Qu.:3.8         
##  Median :4.0             Median :8.8         Median :3.9         
##  Mean   :3.7             Mean   :8.7         Mean   :3.9         
##  3rd Qu.:4.3             3rd Qu.:9.0         3rd Qu.:4.0         
##  Max.   :4.4             Max.   :9.0         Max.   :4.3         
##  NA's   :2750            NA's   :2654        NA's   :2757
food.group <- read.table("sr22/FD_GROUP.txt", sep = "^", quote = "~")
names(food.group) <- c("food.group.id", "food.group")
food.group
##    food.group.id                        food.group
## 1            100            Dairy and Egg Products
## 2            200                  Spices and Herbs
## 3            300                        Baby Foods
## 4            400                     Fats and Oils
## 5            500                  Poultry Products
## 6            600        Soups, Sauces, and Gravies
## 7            700       Sausages and Luncheon Meats
## 8            800                 Breakfast Cereals
## 9            900           Fruits and Fruit Juices
## 10          1000                     Pork Products
## 11          1100 Vegetables and Vegetable Products
## 12          1200             Nut and Seed Products
## 13          1300                     Beef Products
## 14          1400                         Beverages
## 15          1500    Finfish and Shellfish Products
## 16          1600       Legumes and Legume Products
## 17          1700     Lamb, Veal, and Game Products
## 18          1800                    Baked Products
## 19          1900                            Sweets
## 20          2000           Cereal Grains and Pasta
## 21          2100                        Fast Foods
## 22          2200    Meals, Entrees, and Sidedishes
## 23          2500                            Snacks
## 24          3500                      Ethnic Foods
## 25          3600                  Restaurant Foods
raw.data <- read.table("sr22/NUT_DATA.txt", sep = "^", quote = "~")
names(raw.data) <- c("food.id", "nutrient.id", "amount", "data.points", "stderr", 
    "data.type", "data.derivation", "imputed.value.nid", "added.nutrient", "number.of.studies", 
    "min", "max", "degrees.of.freedom", "errorlower.95%", "error.upper.95%", 
    "stats.comments", "confidence.code")
head(raw.data)
##   food.id nutrient.id amount data.points stderr data.type data.derivation
## 1    1001         203   0.85          16  0.074         1                
## 2    1001         204  81.11         580  0.065         1                
## 3    1001         205   0.06           0     NA         4              NC
## 4    1001         207   2.11          35  0.054         1                
## 5    1001         208 717.00           0     NA         4              NC
## 6    1001         221   0.00           0     NA         7                
##   imputed.value.nid added.nutrient number.of.studies min max
## 1                NA                               NA  NA  NA
## 2                NA                               NA  NA  NA
## 3                NA                               NA  NA  NA
## 4                NA                               NA  NA  NA
## 5                NA                               NA  NA  NA
## 6                NA                               NA  NA  NA
##   degrees.of.freedom errorlower.95% error.upper.95% stats.comments
## 1                 NA             NA              NA               
## 2                 NA             NA              NA               
## 3                 NA             NA              NA               
## 4                 NA             NA              NA               
## 5                 NA             NA              NA               
## 6                 NA             NA              NA               
##   confidence.code
## 1                
## 2                
## 3                
## 4                
## 5                
## 6
summary(raw.data)
##     food.id       nutrient.id      amount        data.points    
##  Min.   : 1001   Min.   :203   Min.   :     0   Min.   :   0.0  
##  1st Qu.: 8218   1st Qu.:309   1st Qu.:     0   1st Qu.:   0.0  
##  Median :12166   Median :421   Median :     0   Median :   0.0  
##  Mean   :14081   Mean   :440   Mean   :    53   Mean   :   2.8  
##  3rd Qu.:18338   3rd Qu.:607   3rd Qu.:     4   3rd Qu.:   1.0  
##  Max.   :93600   Max.   :858   Max.   :100000   Max.   :2526.0  
##                                                                 
##      stderr         data.type     data.derivation  imputed.value.nid
##  Min.   :    0    Min.   : 1.00          :203705   Min.   : 1001    
##  1st Qu.:    0    1st Qu.: 1.00   A      : 90574   1st Qu.: 5641    
##  Median :    0    Median : 1.00   Z      : 45787   Median :10068    
##  Mean   :    4    Mean   : 3.08   NC     : 30317   Mean   :10658    
##  3rd Qu.:    1    3rd Qu.: 4.00   FLA    : 17300   3rd Qu.:14154    
##  Max.   :10927    Max.   :13.00   MC     : 16896   Max.   :44260    
##  NA's   :463511                   (Other):129963   NA's   :513070   
##  added.nutrient number.of.studies      min              max        
##   :530398       Min.   : 0        Min.   :    0    Min.   :    0   
##  N:  1699       1st Qu.: 1        1st Qu.:    0    1st Qu.:    0   
##  Y:  2445       Median : 1        Median :    0    Median :    0   
##                 Mean   : 1        Mean   :   31    Mean   :   54   
##                 3rd Qu.: 1        3rd Qu.:    4    3rd Qu.:    6   
##                 Max.   :22        Max.   :19729    Max.   :56000   
##                 NA's   :483599    NA's   :486210   NA's   :486210  
##  degrees.of.freedom errorlower.95%    error.upper.95%  stats.comments  
##  Min.   :   0       Min.   :-125682   Min.   :     0          :486966  
##  1st Qu.:   2       1st Qu.:      0   1st Qu.:     0   2, 3   : 19795  
##  Median :   3       Median :      0   Median :     1   1      :  9457  
##  Mean   :   5       Mean   :     10   Mean   :    97   1, 2, 3:  8318  
##  3rd Qu.:   5       3rd Qu.:      6   3rd Qu.:    14   4      :  5358  
##  Max.   :1299       Max.   :  27219   Max.   :151995   2      :  3369  
##  NA's   :500126     NA's   :505393    NA's   :505393   (Other):  1279  
##  confidence.code
##   :533173       
##  A:    14       
##  B:    32       
##  C:   799       
##  D:   524

The first column in NUT_DATA.txt is the food id code. The second column is the nutrient id, which corresponds to the NUTR_DEF.txt file. The third column is the quantity of the listed nutrient per 100g of the listed food. Please forgive the bad formatting and give a moment of thanks to the powers that be for even releasing this data set.

If you're wondering where to get all the variable names, notice the file called sr22_doc.pdf that comes with the data. Documentation? Amazing! Starting on page 23, you can find tables discussing what the various columns mean.

# Let's get this data into all into one table so we can refer to things by
# name, rather than number
data <- merge(raw.data, nut.def)
data <- merge(data, food.def)
data <- merge(data, food.group)
head(data)
##   food.group.id food.id nutrient.id amount data.points stderr data.type
## 1           100    1001         203  0.850          16  0.074         1
## 2           100    1001         205  0.060           0     NA         4
## 3           100    1001         454  0.300           1     NA         1
## 4           100    1001         619  0.315           0     NA         1
## 5           100    1001         405  0.034           9  0.004         1
## 6           100    1001         502  0.038           0     NA         1
##   data.derivation imputed.value.nid added.nutrient number.of.studies min
## 1                                NA                               NA  NA
## 2              NC                NA                               NA  NA
## 3               A                NA                               NA  NA
## 4              AS                NA                               NA  NA
## 5                                NA                               NA  NA
## 6                                NA                               NA  NA
##   max degrees.of.freedom errorlower.95% error.upper.95% stats.comments
## 1  NA                 NA             NA              NA               
## 2  NA                 NA             NA              NA               
## 3  NA                 NA             NA              NA               
## 4  NA                 NA             NA              NA               
## 5  NA                 NA             NA              NA               
## 6  NA                 NA             NA              NA               
##   confidence.code nutrient.units INFOFOODS.tag        nutrient.description
## 1                              g        PROCNT                     Protein
## 2                              g        CHOCDF Carbohydrate, by difference
## 3                             mg          BETN                     Betaine
## 4                              g         F18D3       18:3 undifferentiated
## 5                             mg          RIBF                  Riboflavin
## 6                              g         THR_G                   Threonine
##   nutrient.value.decimal.places sort.order food.long.description
## 1                             2        600        Butter, salted
## 2                             2       1100        Butter, salted
## 3                             1       7270        Butter, salted
## 4                             3      13900        Butter, salted
## 5                             3       6500        Butter, salted
## 6                             3      16400        Butter, salted
##    food.short.desc comon.name manufacturer complete.profile inedible.parts
## 1 BUTTER,WITH SALT                                        Y               
## 2 BUTTER,WITH SALT                                        Y               
## 3 BUTTER,WITH SALT                                        Y               
## 4 BUTTER,WITH SALT                                        Y               
## 5 BUTTER,WITH SALT                                        Y               
## 6 BUTTER,WITH SALT                                        Y               
##   percent.inedible scientific.name nitrogen.protein.factor
## 1                0                                    6.38
## 2                0                                    6.38
## 3                0                                    6.38
## 4                0                                    6.38
## 5                0                                    6.38
## 6                0                                    6.38
##   protein.calories.factor fat.calories.factor carb.calories.factor
## 1                    4.27                8.79                 3.87
## 2                    4.27                8.79                 3.87
## 3                    4.27                8.79                 3.87
## 4                    4.27                8.79                 3.87
## 5                    4.27                8.79                 3.87
## 6                    4.27                8.79                 3.87
##               food.group
## 1 Dairy and Egg Products
## 2 Dairy and Egg Products
## 3 Dairy and Egg Products
## 4 Dairy and Egg Products
## 5 Dairy and Egg Products
## 6 Dairy and Egg Products
summary(data)
##  food.group.id     food.id       nutrient.id      amount      
##  Min.   : 100   Min.   : 1001   Min.   :203   Min.   :     0  
##  1st Qu.: 800   1st Qu.: 8218   1st Qu.:309   1st Qu.:     0  
##  Median :1100   Median :12166   Median :421   Median :     0  
##  Mean   :1238   Mean   :14081   Mean   :440   Mean   :    53  
##  3rd Qu.:1700   3rd Qu.:18338   3rd Qu.:607   3rd Qu.:     4  
##  Max.   :3600   Max.   :93600   Max.   :858   Max.   :100000  
##                                                               
##   data.points         stderr         data.type     data.derivation 
##  Min.   :   0.0   Min.   :    0    Min.   : 1.00          :203705  
##  1st Qu.:   0.0   1st Qu.:    0    1st Qu.: 1.00   A      : 90574  
##  Median :   0.0   Median :    0    Median : 1.00   Z      : 45787  
##  Mean   :   2.8   Mean   :    4    Mean   : 3.08   NC     : 30317  
##  3rd Qu.:   1.0   3rd Qu.:    1    3rd Qu.: 4.00   FLA    : 17300  
##  Max.   :2526.0   Max.   :10927    Max.   :13.00   MC     : 16896  
##                   NA's   :463511                   (Other):129963  
##  imputed.value.nid added.nutrient number.of.studies      min        
##  Min.   : 1001      :530398       Min.   : 0        Min.   :    0   
##  1st Qu.: 5641     N:  1699       1st Qu.: 1        1st Qu.:    0   
##  Median :10068     Y:  2445       Median : 1        Median :    0   
##  Mean   :10658                    Mean   : 1        Mean   :   31   
##  3rd Qu.:14154                    3rd Qu.: 1        3rd Qu.:    4   
##  Max.   :44260                    Max.   :22        Max.   :19729   
##  NA's   :513070                   NA's   :483599    NA's   :486210  
##       max         degrees.of.freedom errorlower.95%    error.upper.95% 
##  Min.   :    0    Min.   :   0       Min.   :-125682   Min.   :     0  
##  1st Qu.:    0    1st Qu.:   2       1st Qu.:      0   1st Qu.:     0  
##  Median :    0    Median :   3       Median :      0   Median :     1  
##  Mean   :   54    Mean   :   5       Mean   :     10   Mean   :    97  
##  3rd Qu.:    6    3rd Qu.:   5       3rd Qu.:      6   3rd Qu.:    14  
##  Max.   :56000    Max.   :1299       Max.   :  27219   Max.   :151995  
##  NA's   :486210   NA's   :500126     NA's   :505393    NA's   :505393  
##  stats.comments   confidence.code nutrient.units      INFOFOODS.tag   
##         :486966    :533173        g      :287449   VITD      :  8308  
##  2, 3   : 19795   A:    14        mg     :136068             :  8221  
##  1      :  9457   B:    32        mcg    : 72202   CHOCDF    :  7538  
##  1, 2, 3:  8318   C:   799        IU     : 11301   ENERC_KCAL:  7538  
##  4      :  5358   D:   524        kJ     :  7538   ENERC_KJ  :  7538  
##  2      :  3369                   kcal   :  7538   (Other)   :487943  
##  (Other):  1279                   (Other): 12446   NA's      :  7456  
##                   nutrient.description nutrient.value.decimal.places
##  Energy                     : 15076    Min.   :0.00                 
##  Carbohydrate, by difference:  7538    1st Qu.:0.00                 
##  Protein                    :  7538    Median :3.00                 
##  Total lipid (fat)          :  7538    Mean   :1.92                 
##  Water                      :  7534    3rd Qu.:3.00                 
##  Ash                        :  7533    Max.   :3.00                 
##  (Other)                    :481785                                 
##    sort.order   
##  Min.   :  100  
##  1st Qu.: 6100  
##  Median : 7920  
##  Mean   : 9426  
##  3rd Qu.:14000  
##  Max.   :18400  
##                 
##                                                       food.long.description
##  Seeds, pumpkin and squash seed kernels, roasted, with salt added:   131   
##  Seeds, pumpkin and squash seed kernels, roasted, without salt   :   131   
##  Fast foods, chicken tenders                                     :   130   
##  Snacks, bagel chips                                             :   130   
##  Snacks, pita chips, salted                                      :   130   
##  Cereals ready-to-eat, KELLOGG, KELLOGG'S RICE KRISPIES          :   129   
##  (Other)                                                         :533761   
##                                                  food.short.desc  
##  POPCORN,OIL-POPPED,LOFAT                                :   196  
##  OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT:   192  
##  BABYFOOD,MEAT,BF,STR                                    :   177  
##  PUMPKIN&SQUASH SD KRNLS,RSTD,W/SALT                     :   131  
##  PUMPKIN&SQUASH SD KRNLS,RSTD,WO/SALT                    :   131  
##  FAST FOODS,CHICK TENDERS                                :   130  
##  (Other)                                                 :533585  
##         comon.name                     manufacturer    complete.profile
##              :478260                         :480811    :283554        
##  KFC         :  2235   Kellogg, Co.          :  7867   Y:250988        
##  family style:  1767   The Quaker Oats, Co.  :  6418                   
##  hamburger   :  1449   Campbell Soup Co.     :  5655                   
##  buffalo     :  1032   McDonald's Corporation:  3975                   
##  Sweetpotato :  1016   General Mills Inc.    :  3743                   
##  (Other)     : 48783   (Other)               : 26073                   
##                     inedible.parts   percent.inedible
##                            :416219   Min.   : 0.0    
##  Bone                      : 29270   1st Qu.: 0.0    
##  Bone and connective tissue:  8021   Median : 0.0    
##  Connective tissue         :  5192   Mean   : 5.3    
##  Separable fat             :  2441   3rd Qu.: 0.0    
##  Shells                    :  2315   Max.   :99.0    
##  (Other)                   : 71084   NA's   :2828    
##                  scientific.name   nitrogen.protein.factor
##                          :484938   Min.   :0              
##  Phaseolus vulgaris      :  1536   1st Qu.:6              
##  Struthio camelus        :  1389   Median :6              
##  Dromaius novaehollandiae:  1036   Mean   :6              
##  Bison bison             :  1032   3rd Qu.:6              
##  Agaricus bisporus       :  1027   Max.   :7              
##  (Other)                 : 43584   NA's   :120117         
##  protein.calories.factor fat.calories.factor carb.calories.factor
##  Min.   :0               Min.   :0           Min.   :0           
##  1st Qu.:3               1st Qu.:8           1st Qu.:4           
##  Median :4               Median :9           Median :4           
##  Mean   :4               Mean   :9           Mean   :4           
##  3rd Qu.:4               3rd Qu.:9           3rd Qu.:4           
##  Max.   :4               Max.   :9           Max.   :4           
##  NA's   :166426          NA's   :158354      NA's   :166763      
##                              food.group    
##  Vegetables and Vegetable Products: 60447  
##  Beef Products                    : 43697  
##  Baked Products                   : 36192  
##  Pork Products                    : 30940  
##  Poultry Products                 : 29542  
##  Breakfast Cereals                : 27665  
##  (Other)                          :306059
cat("total data in memory:", object.size(data)/2^20, " MB\n")
## total data in memory: 95.15  MB

Part 1

Reformat the data set so that we have one row per food item, with columns for:

# How many mg magnesium could we expect in foods in general?
magnesium <- subset(data, nutrient.description == "Magnesium, Mg")  # or nid==304, got from the docs
hist(magnesium$amount)

plot of chunk unnamed-chunk-6

library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any
# One measurement per nutrient per column, NA for missing
pivoted <- cast(data, food.id ~ nutrient.id, value = "amount")

# Give columns nice names
map <- unique(data.frame(data$nutrient.id, data$nutrient.description))
names(pivoted) <- c("food.id", as.character(map[order(map$data.nutrient.id), 
    ]$data.nutrient.description))
names(pivoted) <- gsub("[^a-zA-Z0-9]", ".", names(pivoted))

head(pivoted)
##   food.id Protein Total.lipid..fat. Carbohydrate..by.difference  Ash
## 1    1001    0.85             81.11                        0.06 2.11
## 2    1002    0.85             81.11                        0.06 2.11
## 3    1003    0.28             99.48                        0.00 0.00
## 4    1004   21.40             28.74                        2.34 5.11
## 5    1005   23.24             29.68                        2.79 3.18
## 6    1006   20.75             27.68                        0.45 2.70
##   Energy Starch Sucrose Glucose..dextrose. Fructose Lactose Maltose
## 1    717     NA      NA                 NA       NA      NA      NA
## 2    717     NA      NA                 NA       NA      NA      NA
## 3    876     NA      NA                 NA       NA      NA      NA
## 4    353     NA      NA                 NA       NA      NA      NA
## 5    371     NA      NA                 NA       NA      NA      NA
## 6    334     NA      NA                 NA       NA      NA      NA
##   Alcohol..ethyl Water Adjusted.Protein Caffeine Theobromine Energy
## 1              0 15.87               NA        0           0   2999
## 2              0 15.87               NA        0           0   2999
## 3              0  0.24               NA        0           0   3664
## 4              0 42.41               NA        0           0   1477
## 5              0 41.11               NA        0           0   1552
## 6              0 48.42               NA        0           0   1396
##   Sugars..total Galactose Fiber..total.dietary Calcium..Ca Iron..Fe
## 1          0.06        NA                    0          24     0.02
## 2          0.06        NA                    0          24     0.16
## 3          0.00        NA                    0           4     0.00
## 4          0.50        NA                    0         528     0.31
## 5          0.51        NA                    0         674     0.43
## 6          0.45        NA                    0         184     0.50
##   Magnesium..Mg Phosphorus..P Potassium..K Sodium..Na Zinc..Zn Copper..Cu
## 1             2            24           24        576     0.09      0.000
## 2             2            23           26        827     0.05      0.016
## 3             0             3            5          2     0.01      0.001
## 4            23           387          256       1395     2.66      0.040
## 5            24           451          136        560     2.60      0.024
## 6            20           188          152        629     2.38      0.019
##   Fluoride..F Manganese..Mn Selenium..Se Vitamin.A..IU Retinol
## 1         2.8         0.000          1.0          2499     671
## 2         2.8         0.004          1.0          2499     671
## 3          NA         0.000          0.0          3069     824
## 4          NA         0.009         14.5           763     192
## 5          NA         0.012         14.5          1080     286
## 6          NA         0.034         14.5           592     173
##   Vitamin.A..RAE Carotene..beta Carotene..alpha
## 1            684            158               0
## 2            684            158               0
## 3            840            193               0
## 4            198             74               0
## 5            292             76               0
## 6            174              9               0
##   Vitamin.E..alpha.tocopherol. Vitamin.D Vitamin.D2..ergocalciferol.
## 1                         2.32        60                          NA
## 2                         2.32        60                          NA
## 3                         2.80        73                          NA
## 4                         0.25        21                          NA
## 5                         0.26        22                          NA
## 6                         0.24        20                          NA
##   Vitamin.D3..cholecalciferol. Vitamin.D..D2...D3. Cryptoxanthin..beta
## 1                          1.5                 1.5                   0
## 2                          1.5                 1.5                   0
## 3                          1.8                 1.8                   0
## 4                          0.5                 0.5                   0
## 5                          0.5                 0.5                   0
## 6                          0.5                 0.5                   0
##   Lycopene Lutein...zeaxanthin Tocopherol..beta Tocopherol..gamma
## 1        0                   0                0                 0
## 2        0                   0               NA                NA
## 3        0                   0               NA                NA
## 4        0                   0               NA                NA
## 5        0                   0               NA                NA
## 6        0                   0               NA                NA
##   Tocopherol..delta Vitamin.C..total.ascorbic.acid Thiamin Riboflavin
## 1                 0                              0   0.005      0.034
## 2                NA                              0   0.005      0.034
## 3                NA                              0   0.001      0.005
## 4                NA                              0   0.029      0.382
## 5                NA                              0   0.014      0.351
## 6                NA                              0   0.070      0.520
##   Niacin Pantothenic.acid Vitamin.B.6 Folate..total Vitamin.B.12
## 1  0.042            0.110       0.003             3         0.17
## 2  0.042            0.110       0.003             3         0.13
## 3  0.003            0.010       0.001             0         0.01
## 4  1.016            1.729       0.166            36         1.22
## 5  0.118            0.288       0.065            20         1.26
## 6  0.380            0.690       0.235            65         1.65
##   Choline..total Vitamin.K..phylloquinone. Folic.acid Folate..food
## 1           18.8                       7.0          0            3
## 2           18.8                       7.0          0            3
## 3           22.3                       8.6          0            0
## 4           15.4                       2.4          0           36
## 5           15.4                       2.5          0           20
## 6           15.4                       2.3          0           65
##   Folate..DFE Betaine Tryptophan Threonine Isoleucine Leucine Lysine
## 1           3     0.3      0.012     0.038      0.051   0.083  0.067
## 2           3     0.3      0.012     0.038      0.051   0.083  0.067
## 3           0      NA      0.004     0.013      0.017   0.027  0.022
## 4          36      NA      0.312     0.785      1.124   1.919  1.852
## 5          20      NA      0.324     0.882      1.137   2.244  2.124
## 6          65      NA      0.322     0.751      1.015   1.929  1.851
##   Methionine Cystine Phenylalanine Tyrosine Valine Arginine Histidine
## 1      0.021   0.008         0.041    0.041  0.057    0.031     0.023
## 2      0.021   0.008         0.041    0.041  0.057    0.031     0.023
## 3      0.007   0.003         0.014    0.014  0.019    0.010     0.008
## 4      0.584   0.107         1.087    1.295  1.556    0.711     0.758
## 5      0.565   0.131         1.231    1.115  1.472    0.874     0.823
## 6      0.592   0.114         1.158    1.200  1.340    0.735     0.716
##   Alanine Aspartic.acid Glutamic.acid Glycine Proline Serine
## 1   0.029         0.064         0.178   0.018   0.082  0.046
## 2   0.029         0.064         0.178   0.018   0.082  0.046
## 3   0.010         0.021         0.059   0.006   0.027  0.015
## 4   0.644         1.436         5.179   0.406   2.100  1.120
## 5   0.670         1.588         5.515   0.437   2.575  1.289
## 6   0.859         1.350         4.387   0.397   2.459  1.168
##   Hydroxyproline Vitamin.E..added Vitamin.B.12..added Cholesterol
## 1             NA                0                   0         215
## 2             NA                0                   0         219
## 3             NA                0                   0         256
## 4             NA                0                   0          75
## 5             NA                0                   0          94
## 6             NA                0                   0         100
##   Fatty.acids..total.trans Fatty.acids..total.saturated   4.0   6.0   8.0
## 1                       NA                        51.37 3.226 2.007 1.190
## 2                       NA                        50.49 2.630 1.557 0.906
## 3                       NA                        61.92 3.226 1.910 1.112
## 4                       NA                        18.67 0.658 0.361 0.247
## 5                       NA                        18.76 0.914 0.373 0.299
## 6                       NA                        17.41 0.564 0.323 0.297
##    10.0  12.0   14.0   16.0   18.0  20.0 18.1.undifferentiated
## 1 2.529 2.587  7.436 21.697  9.999 0.138                19.961
## 2 2.034 2.277  8.157 21.334  9.829    NA                20.405
## 3 2.495 2.793 10.005 26.166 12.056    NA                25.026
## 4 0.601 0.491  3.301  9.153  3.235    NA                 6.622
## 5 0.585 0.482  3.227  8.655  3.455    NA                 7.401
## 6 0.673 0.504  3.065  8.246  2.880    NA                 6.563
##   18.2.undifferentiated 18.3.undifferentiated 20.4.undifferentiated
## 1                 2.728                 0.315                     0
## 2                 1.832                 1.180                     0
## 3                 2.247                 1.447                     0
## 4                 0.536                 0.264                     0
## 5                 0.491                 0.293                     0
## 6                 0.513                 0.313                     0
##   22.6.n.3..DHA. 22.0 14.1 16.1.undifferentiated 18.4 20.1 20.5.n.3..EPA.
## 1              0   NA   NA                 0.961    0  0.1              0
## 2              0   NA   NA                 1.816    0  0.0              0
## 3              0   NA   NA                 2.228    0  0.0              0
## 4              0   NA   NA                 0.816    0  0.0              0
## 5              0   NA   NA                 0.817    0  0.0              0
## 6              0   NA   NA                 1.007    0  0.0              0
##   22.1.undifferentiated 22.5.n.3..DPA. Phytosterols Stigmasterol
## 1                     0              0           NA            0
## 2                     0              0           NA           NA
## 3                     0              0           NA           NA
## 4                     0              0           NA           NA
## 5                     0              0           NA           NA
## 6                     0              0           NA           NA
##   Campesterol Beta.sitosterol Fatty.acids..total.monounsaturated
## 1           0               4                             21.021
## 2          NA              NA                             23.426
## 3          NA              NA                             28.732
## 4          NA              NA                              7.778
## 5          NA              NA                              8.598
## 6          NA              NA                              8.013
##   Fatty.acids..total.polyunsaturated 15.0 17.0 24.0 16.1.t 18.1.t 22.1.t
## 1                              3.043   NA 0.56   NA     NA  2.982     NA
## 2                              3.012   NA   NA   NA     NA     NA     NA
## 3                              3.694   NA   NA   NA     NA     NA     NA
## 4                              0.800   NA   NA   NA     NA     NA     NA
## 5                              0.784   NA   NA   NA     NA     NA     NA
## 6                              0.826   NA   NA   NA     NA     NA     NA
##   18.2.t.not.further.defined 18.2.i 18.2.t.t 18.2.CLAs 24.1.c 20.2.n.6.c.c
## 1                         NA  0.296       NA     0.267     NA           NA
## 2                         NA     NA       NA        NA     NA           NA
## 3                         NA     NA       NA        NA     NA           NA
## 4                         NA     NA       NA        NA     NA           NA
## 5                         NA     NA       NA        NA     NA           NA
## 6                         NA     NA       NA        NA     NA           NA
##   16.1.c 18.1.c 18.2.n.6.c.c 22.1.c 18.3.n.6.c.c.c 17.1
## 1  0.961  16.98        2.166     NA             NA   NA
## 2     NA     NA           NA     NA             NA   NA
## 3     NA     NA           NA     NA             NA   NA
## 4     NA     NA           NA     NA             NA   NA
## 5     NA     NA           NA     NA             NA   NA
## 6     NA     NA           NA     NA             NA   NA
##   20.3.undifferentiated Fatty.acids..total.trans.monoenoic
## 1                    NA                                 NA
## 2                    NA                                 NA
## 3                    NA                                 NA
## 4                    NA                                 NA
## 5                    NA                                 NA
## 6                    NA                                 NA
##   Fatty.acids..total.trans.polyenoic 13.0 15.1 18.3.n.3.c.c.c..ALA.
## 1                                 NA   NA   NA                0.315
## 2                                 NA   NA   NA                   NA
## 3                                 NA   NA   NA                   NA
## 4                                 NA   NA   NA                   NA
## 5                                 NA   NA   NA                   NA
## 6                                 NA   NA   NA                   NA
##   20.3.n.3 20.3.n.6 20.4.n.6 18.3i 21.5 22.4
## 1       NA       NA       NA    NA   NA   NA
## 2       NA       NA       NA    NA   NA   NA
## 3       NA       NA       NA    NA   NA   NA
## 4       NA       NA       NA    NA   NA   NA
## 5       NA       NA       NA    NA   NA   NA
## 6       NA       NA       NA    NA   NA   NA
# This is R, so we just leave the variable in

Part 2

Start to look at pairwise scatterplots of magnesium versus different features. Do any stand out to you? Are there certain things that seem to come along often with magnesium?

corr <- as.data.frame(cor(pivoted$Magnesium..Mg, pivoted[, names(pivoted)[2:ncol(pivoted)]], 
    use = "pairwise.complete.obs"))
names(corr) <- names(pivoted)[2:ncol(pivoted)]
sorted <- t(sort(corr))
sorted
##                                          [,1]
## Water                              -0.4329084
## 18.2.i                             -0.3286633
## Campesterol                        -0.2042567
## 20.3.n.6                           -0.1978139
## 18.2.t.not.further.defined         -0.1655530
## 18.2.CLAs                          -0.1540811
## Hydroxyproline                     -0.1496551
## Stigmasterol                       -0.1414723
## 18.2.t.t                           -0.1404232
## Fatty.acids..total.trans.polyenoic -0.1351530
## 16.1.t                             -0.1340378
## Fatty.acids..total.trans.monoenoic -0.1220644
## Fluoride..F                        -0.1205641
## 18.1.t                             -0.1202294
## Fatty.acids..total.trans           -0.1125682
## 21.5                               -0.1105754
## 16.1.c                             -0.1075911
## 15.0                               -0.1064847
## 17.0                               -0.1005923
## 14.1                               -0.0989674
## 22.4                               -0.0931966
## 18.3.n.3.c.c.c..ALA.               -0.0861194
## Cholesterol                        -0.0802892
## 20.4.undifferentiated              -0.0719026
## 13.0                               -0.0678139
## 16.1.undifferentiated              -0.0656362
## 20.2.n.6.c.c                       -0.0655167
## 18.3.n.6.c.c.c                     -0.0610707
## Tocopherol..delta                  -0.0509159
## Alcohol..ethyl                     -0.0433235
## 17.1                               -0.0414351
## 20.3.n.3                           -0.0410460
## Glucose..dextrose.                 -0.0365333
## 6.0                                -0.0346117
## Lactose                            -0.0319846
## 14.0                               -0.0316003
## 18.0                               -0.0287940
## 4.0                                -0.0280181
## Fatty.acids..total.saturated       -0.0271370
## 18.4                               -0.0194737
## 22.5.n.3..DPA.                     -0.0191399
## 16.0                               -0.0160485
## Vitamin.D                          -0.0150318
## Vitamin.D..D2...D3.                -0.0149966
## 10.0                               -0.0114870
## 20.5.n.3..EPA.                     -0.0108475
## 12.0                               -0.0103500
## Sodium..Na                         -0.0088947
## Fructose                           -0.0067045
## 22.6.n.3..DHA.                     -0.0041905
## Vitamin.B.12                       -0.0015001
## Carotene..alpha                     0.0005921
## 8.0                                 0.0012007
## 22.1.undifferentiated               0.0032097
## Retinol                             0.0036551
## 18.1.c                              0.0060274
## 15.1                                0.0061960
## Lycopene                            0.0066968
## 22.1.c                              0.0101937
## Vitamin.D2..ergocalciferol.         0.0103552
## 18.3i                               0.0113164
## 20.3.undifferentiated               0.0147643
## 20.1                                0.0169916
## Vitamin.A..RAE                      0.0186488
## Phytosterols                        0.0283239
## Lysine                              0.0338393
## Cryptoxanthin..beta                 0.0412096
## 18.3.undifferentiated               0.0503513
## Starch                              0.0516952
## Lutein...zeaxanthin                 0.0523307
## Beta.sitosterol                     0.0562936
## 18.2.n.6.c.c                        0.0614297
## Carotene..beta                      0.0622167
## Maltose                             0.0646609
## Fatty.acids..total.monounsaturated  0.0668999
## Methionine                          0.0672961
## Total.lipid..fat.                   0.0749518
## Sugars..total                       0.0777664
## 18.1.undifferentiated               0.0799733
## Vitamin.A..IU                       0.0804608
## Sucrose                             0.0821277
## Galactose                           0.0899923
## Choline..total                      0.0976150
## 22.1.t                              0.1039422
## Vitamin.C..total.ascorbic.acid      0.1097878
## Alanine                             0.1142263
## 22.0                                0.1142871
## Folic.acid                          0.1234879
## Selenium..Se                        0.1249311
## Histidine                           0.1256993
## Glycine                             0.1273785
## Vitamin.D3..cholecalciferol.        0.1298490
## Vitamin.B.12..added                 0.1418295
## Vitamin.E..added                    0.1440923
## Caffeine                            0.1465752
## Isoleucine                          0.1558778
## Threonine                           0.1571095
## Fatty.acids..total.polyunsaturated  0.1588344
## 24.0                                0.1609758
## 24.1.c                              0.1625686
## 18.2.undifferentiated               0.1700506
## Proline                             0.1738391
## Leucine                             0.1766833
## Tocopherol..beta                    0.1807530
## Tocopherol..gamma                   0.1812013
## Vitamin.K..phylloquinone.           0.1829710
## 20.0                                0.1842858
## Tyrosine                            0.1946465
## Valine                              0.1965024
## Betaine                             0.1999070
## Zinc..Zn                            0.2175297
## Vitamin.E..alpha.tocopherol.        0.2223973
## Riboflavin                          0.2247110
## Pantothenic.acid                    0.2270824
## Protein                             0.2295670
## Folate..DFE                         0.2331226
## Niacin                              0.2544632
## Aspartic.acid                       0.2588573
## Ash                                 0.2721394
## Energy                              0.2782127
## Energy.1                            0.2782127
## Serine                              0.2816614
## Thiamin                             0.2817144
## Theobromine                         0.2874173
## Vitamin.B.6                         0.2878268
## Folate..total                       0.2889119
## Phenylalanine                       0.2900741
## Calcium..Ca                         0.2920401
## Tryptophan                          0.3065870
## Glutamic.acid                       0.3117075
## Cystine                             0.3206655
## Carbohydrate..by.difference         0.3223046
## Manganese..Mn                       0.3244848
## Folate..food                        0.3628731
## Copper..Cu                          0.3629089
## Arginine                            0.3694273
## Iron..Fe                            0.3918252
## Phosphorus..P                       0.4270172
## Potassium..K                        0.5108993
## Fiber..total.dietary                0.6323530
## 20.4.n.6                            0.7459222
## Adjusted.Protein                    1.0000000
## Magnesium..Mg                       1.0000000
plot(Magnesium..Mg ~ Adjusted.Protein, pivoted)

plot of chunk unnamed-chunk-10

Well, that looks like no data.

plot(pivoted$Magnesium..Mg, pivoted[, "20.4.n.6"])

plot of chunk unnamed-chunk-11

And this one's crap.

plot(Magnesium..Mg ~ Fiber..total.dietary, pivoted)

plot of chunk unnamed-chunk-12

…better.

plot(Magnesium..Mg ~ Potassium..K, pivoted)

plot of chunk unnamed-chunk-13

Not bad.

plot(Magnesium..Mg ~ Carbohydrate..by.difference, pivoted)

plot of chunk unnamed-chunk-14

Let's get serious about this:

# library(ggplot2) (ggplot(melt(pivoted, id.vars=c('food.id',
# 'Magnesium..Mg'))) + aes(Magnesium..Mg, value) + geom_point() +
# facet_wrap(~variable))

The model you’re going to create is basically one that fits the best line through all of these scatterplots at the same time, which is why it’s called a linear model. You’re going to make a model that says that the average outcome fits the equation for a line, y = mx+b. (That’s in two dimensions anyway, one dimension for the input and one for the output. A “line” with two dimensions of input has the equation y = mx + nw + b, which actually cuts a plane instead).

  1. Now fit a linear model using the ordinary least squares algorithm. Recall that this model will tell you how much the average quantity of magnesium (in mg) will vary for each increase of 1 in that feature. So if the coefficient for fat is 2.1 (I have no idea what it should be), then you should expect an additional 2.1mg of magnesium for every gram of fat. And if the coefficient for dairy is 5.5, then you should expect an additional 5.5mg of magnesium on average just for being a dairy product.
# Pivoted label data
data <- read.csv("real_stuff.csv")

# Drop out colums we already eliminated in python
data$Vitamin.A..IU <- NULL
data$Fatty.acids..total.monounsaturated <- NULL
data$Fatty.acids..total.polyunsaturated <- NULL
data$Fatty.acids..total.trans.monoenoic <- NULL
data$Fatty.acids..total.trans.polyenoic <- NULL

# Load in food group data to use as indicator variables
id.group <- read.csv("id_group.csv")
merged <- merge(id.group, data)

Question 3

Now fit a linear model using the ordinary least squares algorithm. Recall that this model will tell you how much the average quantity of magnesium (in mg) will vary for each increase of 1 in that feature. So if the coefficient for fat is 2.1 (I have no idea what it should be), then you should expect an additional 2.1mg of magnesium for every gram of fat. And if the coefficient for dairy is 5.5, then you should expect an additional 5.5mg of magnesium on average just for being a dairy product.

You shouldn’t need to code up OLS; there’s a plethora of good OLS algorithms in Python (StatsModels, ols.py and scikit-learn are all excellent choices) or you could load this data set in to R and work in there. Before you do that though, you should read ahead a bit. Also make sure you fit a regression without normalization and with an intercept, if those are options you can set.
How many mg is your model off by on average?
Let’s say you want a model that you can hold in your head. How good are each of the coefficients? We may be able to eliminate some of them by seeing which ones are the most strongly positive or negative and seem unlikely to include zero (that is, have no effect).

# No food group data
fit.no.group <- lm(Magnesium..Mg ~ . - food.id, data)
summary(fit.no.group)
## 
## Call:
## lm(formula = Magnesium..Mg ~ . - food.id, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -276.8  -16.4   -4.1    8.9  470.6 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.42e+01   4.90e+00   -4.94  9.6e-07 ***
## Protein                         2.39e+00   2.06e-01   11.62  < 2e-16 ***
## Total.lipid..fat.               8.45e-01   1.20e-01    7.04  4.1e-12 ***
## Carbohydrate..by.difference     3.15e-01   9.93e-02    3.18   0.0016 ** 
## Sugars..total                   2.04e-01   1.51e-01    1.36   0.1754    
## Fiber..total.dietary            5.67e+00   3.75e-01   15.14  < 2e-16 ***
## Calcium..Ca                     2.33e-02   8.50e-03    2.74   0.0063 ** 
## Iron..Fe                       -1.35e-01   2.57e-01   -0.53   0.5980    
## Sodium..Na                     -9.25e-03   3.75e-03   -2.47   0.0138 *  
## Vitamin.A..RAE                 -1.03e-03   9.23e-04   -1.12   0.2642    
## Vitamin.C..total.ascorbic.acid  3.84e-01   6.48e-02    5.93  4.6e-09 ***
## Cholesterol                    -1.55e-02   9.57e-03   -1.62   0.1050    
## Fatty.acids..total.trans       -1.64e+00   5.35e-01   -3.06   0.0023 ** 
## Fatty.acids..total.saturated   -7.24e-01   2.20e-01   -3.29   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 42.6 on 791 degrees of freedom
##   (6733 observations deleted due to missingness)
## Multiple R-squared: 0.507,   Adjusted R-squared: 0.499 
## F-statistic: 62.5 on 13 and 791 DF,  p-value: <2e-16
fit.no.group2 <- lm(Magnesium..Mg ~ . - food.id - Sugars..total - Iron..Fe - 
    Vitamin.A..RAE - Cholesterol, data)
summary(fit.no.group2)
## 
## Call:
## lm(formula = Magnesium..Mg ~ . - food.id - Sugars..total - Iron..Fe - 
##     Vitamin.A..RAE - Cholesterol, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -274.8  -16.2   -3.6    9.4  472.5 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -24.73264    4.83921   -5.11  4.0e-07 ***
## Protein                          2.30238    0.20319   11.33  < 2e-16 ***
## Total.lipid..fat.                0.84787    0.12003    7.06  3.5e-12 ***
## Carbohydrate..by.difference      0.39812    0.06345    6.27  5.8e-10 ***
## Fiber..total.dietary             5.52282    0.35172   15.70  < 2e-16 ***
## Calcium..Ca                      0.02278    0.00825    2.76   0.0059 ** 
## Sodium..Na                      -0.00923    0.00370   -2.49   0.0128 *  
## Vitamin.C..total.ascorbic.acid   0.36746    0.06087    6.04  2.4e-09 ***
## Fatty.acids..total.trans        -1.67752    0.53500   -3.14   0.0018 ** 
## Fatty.acids..total.saturated    -0.68258    0.21922   -3.11   0.0019 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 42.7 on 795 degrees of freedom
##   (6733 observations deleted due to missingness)
## Multiple R-squared: 0.502,   Adjusted R-squared: 0.496 
## F-statistic:   89 on 9 and 795 DF,  p-value: <2e-16
# No food group data
fit <- lm(Magnesium..Mg ~ . - X - food.id, merged)
summary(fit)
## 
## Call:
## lm(formula = Magnesium..Mg ~ . - X - food.id, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -220.84  -12.79   -0.04   10.50  285.92 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                  6.51e+00   1.50e+01    0.43
## food.groupBaked Products                     1.14e+01   1.51e+01    0.75
## food.groupBeef Products                     -6.27e+01   1.57e+01   -4.01
## food.groupBeverages                          2.42e+01   1.89e+01    1.28
## food.groupBreakfast Cereals                  1.95e+01   1.42e+01    1.37
## food.groupCereal Grains and Pasta            1.05e+01   1.54e+01    0.68
## food.groupDairy and Egg Products            -1.65e+01   2.86e+01   -0.58
## food.groupEthnic Foods                      -8.04e+01   3.81e+01   -2.11
## food.groupFast Foods                        -2.59e+01   1.51e+01   -1.72
## food.groupFats and Oils                     -2.04e+01   1.68e+01   -1.21
## food.groupFinfish and Shellfish Products    -1.39e+01   3.74e+01   -0.37
## food.groupFruits and Fruit Juices           -1.12e+01   2.85e+01   -0.39
## food.groupLamb, Veal, and Game Products     -5.88e+01   2.35e+01   -2.50
## food.groupLegumes and Legume Products       -3.31e+01   1.52e+01   -2.17
## food.groupMeals, Entrees, and Sidedishes    -1.56e+01   3.73e+01   -0.42
## food.groupNut and Seed Products              1.60e+02   1.86e+01    8.61
## food.groupPork Products                     -5.57e+01   1.53e+01   -3.63
## food.groupPoultry Products                  -5.92e+01   1.60e+01   -3.70
## food.groupSausages and Luncheon Meats       -4.17e+01   1.64e+01   -2.54
## food.groupSnacks                             5.28e+01   1.45e+01    3.63
## food.groupSoups, Sauces, and Gravies         1.59e+00   1.62e+01    0.10
## food.groupSpices and Herbs                   4.76e+01   1.87e+01    2.55
## food.groupSweets                             1.85e+01   1.55e+01    1.19
## food.groupVegetables and Vegetable Products -2.44e+00   1.73e+01   -0.14
## Protein                                      3.21e+00   2.51e-01   12.80
## Total.lipid..fat.                            3.72e-01   1.26e-01    2.96
## Carbohydrate..by.difference                 -4.29e-01   1.39e-01   -3.08
## Sugars..total                                2.28e-01   1.75e-01    1.30
## Fiber..total.dietary                         4.35e+00   3.38e-01   12.84
## Calcium..Ca                                  1.23e-02   7.30e-03    1.69
## Iron..Fe                                     4.39e-01   2.62e-01    1.68
## Sodium..Na                                  -4.78e-03   3.67e-03   -1.30
## Vitamin.A..RAE                              -6.89e-04   8.69e-04   -0.79
## Vitamin.C..total.ascorbic.acid               2.70e-01   5.59e-02    4.84
## Cholesterol                                  2.18e-03   8.08e-03    0.27
## Fatty.acids..total.trans                    -5.60e-01   4.49e-01   -1.25
## Fatty.acids..total.saturated                -2.96e-01   1.82e-01   -1.63
##                                             Pr(>|t|)    
## (Intercept)                                  0.66382    
## food.groupBaked Products                     0.45058    
## food.groupBeef Products                      6.7e-05 ***
## food.groupBeverages                          0.19962    
## food.groupBreakfast Cereals                  0.16973    
## food.groupCereal Grains and Pasta            0.49482    
## food.groupDairy and Egg Products             0.56251    
## food.groupEthnic Foods                       0.03525 *  
## food.groupFast Foods                         0.08586 .  
## food.groupFats and Oils                      0.22646    
## food.groupFinfish and Shellfish Products     0.70998    
## food.groupFruits and Fruit Juices            0.69380    
## food.groupLamb, Veal, and Game Products      0.01249 *  
## food.groupLegumes and Legume Products        0.02998 *  
## food.groupMeals, Entrees, and Sidedishes     0.67548    
## food.groupNut and Seed Products              < 2e-16 ***
## food.groupPork Products                      0.00030 ***
## food.groupPoultry Products                   0.00023 ***
## food.groupSausages and Luncheon Meats        0.01134 *  
## food.groupSnacks                             0.00030 ***
## food.groupSoups, Sauces, and Gravies         0.92147    
## food.groupSpices and Herbs                   0.01095 *  
## food.groupSweets                             0.23320    
## food.groupVegetables and Vegetable Products  0.88820    
## Protein                                      < 2e-16 ***
## Total.lipid..fat.                            0.00319 ** 
## Carbohydrate..by.difference                  0.00215 ** 
## Sugars..total                                0.19261    
## Fiber..total.dietary                         < 2e-16 ***
## Calcium..Ca                                  0.09150 .  
## Iron..Fe                                     0.09410 .  
## Sodium..Na                                   0.19338    
## Vitamin.A..RAE                               0.42784    
## Vitamin.C..total.ascorbic.acid               1.6e-06 ***
## Cholesterol                                  0.78695    
## Fatty.acids..total.trans                     0.21295    
## Fatty.acids..total.saturated                 0.10373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 34.7 on 768 degrees of freedom
##   (6733 observations deleted due to missingness)
## Multiple R-squared: 0.682,   Adjusted R-squared: 0.667 
## F-statistic: 45.7 on 36 and 768 DF,  p-value: <2e-16
fit2 <- lm(Magnesium..Mg ~ . - X - food.id - Sodium..Na - Sugars..total - Iron..Fe - 
    Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans - Fatty.acids..total.saturated - 
    Vitamin.A..RAE, merged)
summary(fit2)
## 
## Call:
## lm(formula = Magnesium..Mg ~ . - X - food.id - Sodium..Na - Sugars..total - 
##     Iron..Fe - Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans - 
##     Fatty.acids..total.saturated - Vitamin.A..RAE, data = merged)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -224.59  -11.60   -2.16   10.89  288.45 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                   4.20830   14.88253    0.28
## food.groupBaked Products                     10.12727   15.08122    0.67
## food.groupBeef Products                     -58.93361   15.37465   -3.83
## food.groupBeverages                          26.91634   18.79306    1.43
## food.groupBreakfast Cereals                  24.81798   13.81308    1.80
## food.groupCereal Grains and Pasta             8.81720   15.22526    0.58
## food.groupDairy and Egg Products            -13.44246   28.44947   -0.47
## food.groupEthnic Foods                      -76.10468   38.02391   -2.00
## food.groupFast Foods                        -25.78984   14.86540   -1.73
## food.groupFats and Oils                     -21.61226   16.48300   -1.31
## food.groupFinfish and Shellfish Products    -13.96120   37.39451   -0.37
## food.groupFruits and Fruit Juices            -7.64936   28.37866   -0.27
## food.groupLamb, Veal, and Game Products     -61.79421   21.48741   -2.88
## food.groupLegumes and Legume Products       -32.41212   15.09434   -2.15
## food.groupMeals, Entrees, and Sidedishes    -16.86988   37.28283   -0.45
## food.groupNut and Seed Products             167.86470   18.20938    9.22
## food.groupPork Products                     -54.66799   15.03012   -3.64
## food.groupPoultry Products                  -56.31371   15.70695   -3.59
## food.groupSausages and Luncheon Meats       -42.88334   15.87745   -2.70
## food.groupSnacks                             51.90391   14.48184    3.58
## food.groupSoups, Sauces, and Gravies          0.95879   15.90691    0.06
## food.groupSpices and Herbs                   50.40825   18.57745    2.71
## food.groupSweets                             24.74113   14.28421    1.73
## food.groupVegetables and Vegetable Products  -1.49833   17.27685   -0.09
## Protein                                       3.17839    0.24739   12.85
## Total.lipid..fat.                             0.24549    0.10026    2.45
## Carbohydrate..by.difference                  -0.32918    0.11305   -2.91
## Fiber..total.dietary                          4.25724    0.31860   13.36
## Calcium..Ca                                   0.01626    0.00695    2.34
## Vitamin.C..total.ascorbic.acid                0.30657    0.05128    5.98
##                                             Pr(>|t|)    
## (Intercept)                                  0.77743    
## food.groupBaked Products                     0.50209    
## food.groupBeef Products                      0.00014 ***
## food.groupBeverages                          0.15248    
## food.groupBreakfast Cereals                  0.07277 .  
## food.groupCereal Grains and Pasta            0.56268    
## food.groupDairy and Egg Products             0.63670    
## food.groupEthnic Foods                       0.04569 *  
## food.groupFast Foods                         0.08316 .  
## food.groupFats and Oils                      0.19018    
## food.groupFinfish and Shellfish Products     0.70899    
## food.groupFruits and Fruit Juices            0.78758    
## food.groupLamb, Veal, and Game Products      0.00414 ** 
## food.groupLegumes and Legume Products        0.03208 *  
## food.groupMeals, Entrees, and Sidedishes     0.65105    
## food.groupNut and Seed Products              < 2e-16 ***
## food.groupPork Products                      0.00029 ***
## food.groupPoultry Products                   0.00036 ***
## food.groupSausages and Luncheon Meats        0.00707 ** 
## food.groupSnacks                             0.00036 ***
## food.groupSoups, Sauces, and Gravies         0.95195    
## food.groupSpices and Herbs                   0.00681 ** 
## food.groupSweets                             0.08366 .  
## food.groupVegetables and Vegetable Products  0.93091    
## Protein                                      < 2e-16 ***
## Total.lipid..fat.                            0.01456 *  
## Carbohydrate..by.difference                  0.00370 ** 
## Fiber..total.dietary                         < 2e-16 ***
## Calcium..Ca                                  0.01954 *  
## Vitamin.C..total.ascorbic.acid               3.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 34.8 on 775 degrees of freedom
##   (6733 observations deleted due to missingness)
## Multiple R-squared: 0.678,   Adjusted R-squared: 0.666 
## F-statistic: 56.3 on 29 and 775 DF,  p-value: <2e-16

4. Create 1000 new simulated data sets by sampling with replacement from your original data set, and refit the model for each, storing the coefficients each time. This is called bootstrapping.

## ===========================================================================
##   X.Intercept. food.groupBaked.Products food.groupBeef.Products
## 1        4.208                   10.127                  -58.93
## 2        0.241                   18.534                  -61.92
## 3       -5.672                   32.055                  -57.46
## 4       -4.357                    4.924                  -37.28
## 5       13.491                    1.533                  -68.24
## 6        2.198                    6.814                  -56.64
##   food.groupBeverages food.groupBreakfast.Cereals
## 1              26.916                       24.82
## 2               6.291                       31.99
## 3              38.713                       43.44
## 4              31.011                       19.50
## 5              14.386                       14.84
## 6              35.029                       30.46
##   food.groupCereal.Grains.and.Pasta food.groupDairy.and.Egg.Products
## 1                             8.817                           -13.44
## 2                            14.842                           -28.35
## 3                            30.107                            10.87
## 4                            14.449                               NA
## 5                            -1.792                           -39.00
## 6                            11.234                           -27.45
##   food.groupEthnic.Foods food.groupFast.Foods food.groupFats.and.Oils
## 1                 -76.10              -25.790                 -21.612
## 2                     NA              -27.357                 -10.324
## 3                     NA              -16.229                 -23.643
## 4                     NA               -9.616                  -4.936
## 5                 -85.39              -37.258                 -21.218
## 6                     NA              -25.434                 -15.959
##   food.groupFinfish.and.Shellfish.Products
## 1                                  -13.961
## 2                                       NA
## 3                                   -5.738
## 4                                   -2.270
## 5                                  -21.961
## 6                                  -10.138
##   food.groupFruits.and.Fruit.Juices
## 1                            -7.649
## 2                             1.902
## 3                                NA
## 4                            -6.021
## 5                           -13.274
## 6                            -3.927
##   food.groupLamb..Veal..and.Game.Products
## 1                                  -61.79
## 2                                  -74.42
## 3                                  -66.82
## 4                                  -35.32
## 5                                  -74.88
## 6                                  -56.17
##   food.groupLegumes.and.Legume.Products
## 1                                -32.41
## 2                                -20.04
## 3                                -23.51
## 4                                -19.45
## 5                                -46.79
## 6                                -22.90
##   food.groupMeals..Entrees..and.Sidedishes food.groupNut.and.Seed.Products
## 1                                  -16.870                           167.9
## 2                                       NA                           216.7
## 3                                       NA                           145.9
## 4                                   -7.078                           181.3
## 5                                  -23.957                           114.2
## 6                                       NA                           114.6
##   food.groupPork.Products food.groupPoultry.Products
## 1                  -54.67                     -56.31
## 2                  -55.30                     -56.74
## 3                  -52.27                     -52.70
## 4                  -33.43                     -35.13
## 5                  -62.72                     -60.85
## 6                  -51.99                     -54.71
##   food.groupSausages.and.Luncheon.Meats food.groupSnacks
## 1                                -42.88            51.90
## 2                                -40.71            54.53
## 3                                -42.14            55.68
## 4                                -24.65            39.98
## 5                                -48.64            56.63
## 6                                -40.85            62.75
##   food.groupSoups..Sauces..and.Gravies food.groupSpices.and.Herbs
## 1                               0.9588                      50.41
## 2                              -0.2460                      24.31
## 3                               6.6205                      79.89
## 4                               9.0240                      55.66
## 5                              -7.3165                      69.01
## 6                               2.3970                      60.91
##   food.groupSweets food.groupVegetables.and.Vegetable.Products Protein
## 1            24.74                                      -1.498   3.178
## 2            25.12                                       2.867   3.440
## 3            32.85                                       7.864   3.441
## 4            24.74                                       3.680   2.651
## 5            19.78                                      -9.831   3.190
## 6            28.78                                      -2.939   3.174
##   Total.lipid..fat. Carbohydrate..by.difference Fiber..total.dietary
## 1            0.2455                     -0.3292                4.257
## 2            0.1475                     -0.2938                3.150
## 3            0.3957                     -0.4223                2.575
## 4            0.1373                     -0.1376                4.455
## 5            0.1375                     -0.3130                4.186
## 6            0.2003                     -0.3064                3.583
##   Calcium..Ca Vitamin.C..total.ascorbic.acid
## 1    0.016257                         0.3066
## 2    0.003855                         0.4023
## 3    0.058446                         0.3060
## 4    0.008491                         0.2410
## 5    0.007731                         0.3642
## 6   -0.010721                         0.6662

Question 5

If you look at the 2.5th and 97.5th percentile of the array of coefficients, you’ve got a simulated 95% confidence interval. Which nutrients or categories don’t have 0 in their 95% confidence interval?

Let's take a visual look at this:

plot of chunk unnamed-chunk-20

Of those, which are the biggest? Write a sentence or two explaining what the best heuristics seem to be for buying food that’s high in magnesium. What coefficients are the modeling process least certain about?

## ===========================================================================

plot of chunk unnamed-chunk-21

Bonus

library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
rf <- randomForest(factor(Magnesium..Mg > 50) ~ . - X - food.id - Sodium..Na - 
    Sugars..total - Iron..Fe - Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans - 
    Fatty.acids..total.saturated - Vitamin.A..RAE, merged, na.action = na.omit)
print(rf)
## 
## Call:
##  randomForest(formula = factor(Magnesium..Mg > 50) ~ . - X - food.id -      Sodium..Na - Sugars..total - Iron..Fe - Vitamin.A..RAE -      Cholesterol - Fatty.acids..total.trans - Fatty.acids..total.saturated -      Vitamin.A..RAE, data = merged, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.72%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   632   13     0.02016
## TRUE     25  135     0.15625
varImpPlot(rf)

plot of chunk unnamed-chunk-22