Magnesium is an important micronutrient, especially for pregnant women. Unfortunately, unless you’re looking at a vitamin supplement, you have no idea how much magnesium is in your food. Until now!
There are lots of data sets on data.gov. Download the USDA food nutrition database, release SR22.
The USDA compiles summaries of nutrition experiments. Lots and lots of them. This data set contains the results of 1.5 million experiments for over seven thousand foods to determine the quantities of over a hundred nutrients. We want to build a model of how much magnesium is in a given item of food as a function of its listed nutrition facts and the category of food it comes in. The results comes as several files, but for our purposes the important ones are:
nut.def <- read.table("sr22/NUTR_DEF.txt", sep = "^", quote = "~")
names(nut.def) <- c("nutrient.id", "nutrient.units", "INFOFOODS.tag", "nutrient.description",
"nutrient.value.decimal.places", "sort.order")
head(nut.def)
## nutrient.id nutrient.units INFOFOODS.tag nutrient.description
## 1 203 g PROCNT Protein
## 2 204 g FAT Total lipid (fat)
## 3 205 g CHOCDF Carbohydrate, by difference
## 4 207 g ASH Ash
## 5 208 kcal ENERC_KCAL Energy
## 6 209 g STARCH Starch
## nutrient.value.decimal.places sort.order
## 1 2 600
## 2 2 800
## 3 2 1100
## 4 2 1000
## 5 0 300
## 6 2 2200
summary(nut.def)
## nutrient.id nutrient.units INFOFOODS.tag nutrient.description
## Min. :203 g :91 : 8 Energy : 2
## 1st Qu.:322 mg :29 VITD : 2 10:0 : 1
## Median :511 mcg :17 ALA_G : 1 12:0 : 1
## Mean :495 IU : 2 ALC : 1 13:0 : 1
## 3rd Qu.:634 kJ : 1 ARG_G : 1 14:0 : 1
## Max. :858 kcal : 1 (Other):129 14:1 : 1
## (Other): 2 NA's : 1 (Other):136
## nutrient.value.decimal.places sort.order
## Min. :0.00 Min. : 100
## 1st Qu.:1.00 1st Qu.: 7050
## Median :3.00 Median :10900
## Mean :2.12 Mean :10481
## 3rd Qu.:3.00 3rd Qu.:14800
## Max. :3.00 Max. :18400
food.def <- read.table("sr22/FOOD_DES.txt", sep = "^", quote = "~")
names(food.def) <- c("food.id", "food.group.id", "food.long.description", "food.short.desc",
"comon.name", "manufacturer", "complete.profile", "inedible.parts", "percent.inedible",
"scientific.name", "nitrogen.protein.factor", "protein.calories.factor",
"fat.calories.factor", "carb.calories.factor")
head(food.def)
## food.id food.group.id food.long.description
## 1 1001 100 Butter, salted
## 2 1002 100 Butter, whipped, with salt
## 3 1003 100 Butter oil, anhydrous
## 4 1004 100 Cheese, blue
## 5 1005 100 Cheese, brick
## 6 1006 100 Cheese, brie
## food.short.desc comon.name manufacturer complete.profile
## 1 BUTTER,WITH SALT Y
## 2 BUTTER,WHIPPED,WITH SALT Y
## 3 BUTTER OIL,ANHYDROUS Y
## 4 CHEESE,BLUE Y
## 5 CHEESE,BRICK Y
## 6 CHEESE,BRIE Y
## inedible.parts percent.inedible scientific.name nitrogen.protein.factor
## 1 0 6.38
## 2 0 6.38
## 3 0 6.38
## 4 0 6.38
## 5 0 6.38
## 6 0 6.38
## protein.calories.factor fat.calories.factor carb.calories.factor
## 1 4.27 8.79 3.87
## 2 4.27 8.79 3.87
## 3 4.27 8.79 3.87
## 4 4.27 8.79 3.87
## 5 4.27 8.79 3.87
## 6 4.27 8.79 3.87
summary(food.def)
## food.id food.group.id
## Min. : 1001 Min. : 100
## 1st Qu.: 8128 1st Qu.: 700
## Median :13327 Median :1100
## Mean :14265 Mean :1256
## 3rd Qu.:18498 3rd Qu.:1700
## Max. :93600 Max. :3600
##
## food.long.description
## AMARANTH FLAKES : 1
## APPLEBEE'S, 9 oz house sirloin steak : 1
## APPLEBEE'S, Double Crunch Shrimp : 1
## APPLEBEE'S, French fries : 1
## APPLEBEE'S, KRAFT, Macaroni & Cheese, from kid's menu: 1
## APPLEBEE'S, chicken fingers, from kids' menu : 1
## (Other) :7533
## food.short.desc
## BABYFOOD,MEAT,BF,STR : 2
## CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA : 2
## OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT: 2
## POPCORN,OIL-POPPED,LOFAT : 2
## ABALONE,MIXED SPECIES,RAW : 1
## ABALONE,MXD SP,CKD,FRIED : 1
## (Other) :7529
## comon.name manufacturer
## :6905 :6306
## KFC : 23 Campbell Soup Co. : 318
## family style : 18 Kellogg, Co. : 187
## hamburger : 14 The Quaker Oats, Co. : 110
## buffalo : 13 McDonald's Corporation: 84
## soft drink, pop, soda: 13 General Mills Inc. : 51
## (Other) : 553 (Other) : 483
## complete.profile inedible.parts percent.inedible
## :4644 :6078 Min. : 0.00
## Y:2895 Bone : 387 1st Qu.: 0.00
## Bone and connective tissue: 87 Median : 0.00
## Connective tissue : 62 Mean : 4.68
## Shells : 28 3rd Qu.: 0.00
## Separable fat : 26 Max. :99.00
## (Other) : 871 NA's :36
## scientific.name nitrogen.protein.factor
## :6826 Min. :0.0
## Phaseolus vulgaris : 20 1st Qu.:6.2
## Struthio camelus : 18 Median :6.2
## Bison bison : 13 Mean :5.9
## Bos taurus : 12 3rd Qu.:6.2
## Dromaius novaehollandiae: 12 Max. :6.9
## (Other) : 638 NA's :1908
## protein.calories.factor fat.calories.factor carb.calories.factor
## Min. :0.0 Min. :0.0 Min. :0.0
## 1st Qu.:3.4 1st Qu.:8.4 1st Qu.:3.8
## Median :4.0 Median :8.8 Median :3.9
## Mean :3.7 Mean :8.7 Mean :3.9
## 3rd Qu.:4.3 3rd Qu.:9.0 3rd Qu.:4.0
## Max. :4.4 Max. :9.0 Max. :4.3
## NA's :2750 NA's :2654 NA's :2757
food.group <- read.table("sr22/FD_GROUP.txt", sep = "^", quote = "~")
names(food.group) <- c("food.group.id", "food.group")
food.group
## food.group.id food.group
## 1 100 Dairy and Egg Products
## 2 200 Spices and Herbs
## 3 300 Baby Foods
## 4 400 Fats and Oils
## 5 500 Poultry Products
## 6 600 Soups, Sauces, and Gravies
## 7 700 Sausages and Luncheon Meats
## 8 800 Breakfast Cereals
## 9 900 Fruits and Fruit Juices
## 10 1000 Pork Products
## 11 1100 Vegetables and Vegetable Products
## 12 1200 Nut and Seed Products
## 13 1300 Beef Products
## 14 1400 Beverages
## 15 1500 Finfish and Shellfish Products
## 16 1600 Legumes and Legume Products
## 17 1700 Lamb, Veal, and Game Products
## 18 1800 Baked Products
## 19 1900 Sweets
## 20 2000 Cereal Grains and Pasta
## 21 2100 Fast Foods
## 22 2200 Meals, Entrees, and Sidedishes
## 23 2500 Snacks
## 24 3500 Ethnic Foods
## 25 3600 Restaurant Foods
raw.data <- read.table("sr22/NUT_DATA.txt", sep = "^", quote = "~")
names(raw.data) <- c("food.id", "nutrient.id", "amount", "data.points", "stderr",
"data.type", "data.derivation", "imputed.value.nid", "added.nutrient", "number.of.studies",
"min", "max", "degrees.of.freedom", "errorlower.95%", "error.upper.95%",
"stats.comments", "confidence.code")
head(raw.data)
## food.id nutrient.id amount data.points stderr data.type data.derivation
## 1 1001 203 0.85 16 0.074 1
## 2 1001 204 81.11 580 0.065 1
## 3 1001 205 0.06 0 NA 4 NC
## 4 1001 207 2.11 35 0.054 1
## 5 1001 208 717.00 0 NA 4 NC
## 6 1001 221 0.00 0 NA 7
## imputed.value.nid added.nutrient number.of.studies min max
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## degrees.of.freedom errorlower.95% error.upper.95% stats.comments
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## confidence.code
## 1
## 2
## 3
## 4
## 5
## 6
summary(raw.data)
## food.id nutrient.id amount data.points
## Min. : 1001 Min. :203 Min. : 0 Min. : 0.0
## 1st Qu.: 8218 1st Qu.:309 1st Qu.: 0 1st Qu.: 0.0
## Median :12166 Median :421 Median : 0 Median : 0.0
## Mean :14081 Mean :440 Mean : 53 Mean : 2.8
## 3rd Qu.:18338 3rd Qu.:607 3rd Qu.: 4 3rd Qu.: 1.0
## Max. :93600 Max. :858 Max. :100000 Max. :2526.0
##
## stderr data.type data.derivation imputed.value.nid
## Min. : 0 Min. : 1.00 :203705 Min. : 1001
## 1st Qu.: 0 1st Qu.: 1.00 A : 90574 1st Qu.: 5641
## Median : 0 Median : 1.00 Z : 45787 Median :10068
## Mean : 4 Mean : 3.08 NC : 30317 Mean :10658
## 3rd Qu.: 1 3rd Qu.: 4.00 FLA : 17300 3rd Qu.:14154
## Max. :10927 Max. :13.00 MC : 16896 Max. :44260
## NA's :463511 (Other):129963 NA's :513070
## added.nutrient number.of.studies min max
## :530398 Min. : 0 Min. : 0 Min. : 0
## N: 1699 1st Qu.: 1 1st Qu.: 0 1st Qu.: 0
## Y: 2445 Median : 1 Median : 0 Median : 0
## Mean : 1 Mean : 31 Mean : 54
## 3rd Qu.: 1 3rd Qu.: 4 3rd Qu.: 6
## Max. :22 Max. :19729 Max. :56000
## NA's :483599 NA's :486210 NA's :486210
## degrees.of.freedom errorlower.95% error.upper.95% stats.comments
## Min. : 0 Min. :-125682 Min. : 0 :486966
## 1st Qu.: 2 1st Qu.: 0 1st Qu.: 0 2, 3 : 19795
## Median : 3 Median : 0 Median : 1 1 : 9457
## Mean : 5 Mean : 10 Mean : 97 1, 2, 3: 8318
## 3rd Qu.: 5 3rd Qu.: 6 3rd Qu.: 14 4 : 5358
## Max. :1299 Max. : 27219 Max. :151995 2 : 3369
## NA's :500126 NA's :505393 NA's :505393 (Other): 1279
## confidence.code
## :533173
## A: 14
## B: 32
## C: 799
## D: 524
The first column in NUT_DATA.txt is the food id code. The second column is the nutrient id, which corresponds to the NUTR_DEF.txt file. The third column is the quantity of the listed nutrient per 100g of the listed food. Please forgive the bad formatting and give a moment of thanks to the powers that be for even releasing this data set.
If you're wondering where to get all the variable names, notice the file called sr22_doc.pdf that comes with the data. Documentation? Amazing! Starting on page 23, you can find tables discussing what the various columns mean.
# Let's get this data into all into one table so we can refer to things by
# name, rather than number
data <- merge(raw.data, nut.def)
data <- merge(data, food.def)
data <- merge(data, food.group)
head(data)
## food.group.id food.id nutrient.id amount data.points stderr data.type
## 1 100 1001 203 0.850 16 0.074 1
## 2 100 1001 205 0.060 0 NA 4
## 3 100 1001 454 0.300 1 NA 1
## 4 100 1001 619 0.315 0 NA 1
## 5 100 1001 405 0.034 9 0.004 1
## 6 100 1001 502 0.038 0 NA 1
## data.derivation imputed.value.nid added.nutrient number.of.studies min
## 1 NA NA NA
## 2 NC NA NA NA
## 3 A NA NA NA
## 4 AS NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## max degrees.of.freedom errorlower.95% error.upper.95% stats.comments
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## confidence.code nutrient.units INFOFOODS.tag nutrient.description
## 1 g PROCNT Protein
## 2 g CHOCDF Carbohydrate, by difference
## 3 mg BETN Betaine
## 4 g F18D3 18:3 undifferentiated
## 5 mg RIBF Riboflavin
## 6 g THR_G Threonine
## nutrient.value.decimal.places sort.order food.long.description
## 1 2 600 Butter, salted
## 2 2 1100 Butter, salted
## 3 1 7270 Butter, salted
## 4 3 13900 Butter, salted
## 5 3 6500 Butter, salted
## 6 3 16400 Butter, salted
## food.short.desc comon.name manufacturer complete.profile inedible.parts
## 1 BUTTER,WITH SALT Y
## 2 BUTTER,WITH SALT Y
## 3 BUTTER,WITH SALT Y
## 4 BUTTER,WITH SALT Y
## 5 BUTTER,WITH SALT Y
## 6 BUTTER,WITH SALT Y
## percent.inedible scientific.name nitrogen.protein.factor
## 1 0 6.38
## 2 0 6.38
## 3 0 6.38
## 4 0 6.38
## 5 0 6.38
## 6 0 6.38
## protein.calories.factor fat.calories.factor carb.calories.factor
## 1 4.27 8.79 3.87
## 2 4.27 8.79 3.87
## 3 4.27 8.79 3.87
## 4 4.27 8.79 3.87
## 5 4.27 8.79 3.87
## 6 4.27 8.79 3.87
## food.group
## 1 Dairy and Egg Products
## 2 Dairy and Egg Products
## 3 Dairy and Egg Products
## 4 Dairy and Egg Products
## 5 Dairy and Egg Products
## 6 Dairy and Egg Products
summary(data)
## food.group.id food.id nutrient.id amount
## Min. : 100 Min. : 1001 Min. :203 Min. : 0
## 1st Qu.: 800 1st Qu.: 8218 1st Qu.:309 1st Qu.: 0
## Median :1100 Median :12166 Median :421 Median : 0
## Mean :1238 Mean :14081 Mean :440 Mean : 53
## 3rd Qu.:1700 3rd Qu.:18338 3rd Qu.:607 3rd Qu.: 4
## Max. :3600 Max. :93600 Max. :858 Max. :100000
##
## data.points stderr data.type data.derivation
## Min. : 0.0 Min. : 0 Min. : 1.00 :203705
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 1.00 A : 90574
## Median : 0.0 Median : 0 Median : 1.00 Z : 45787
## Mean : 2.8 Mean : 4 Mean : 3.08 NC : 30317
## 3rd Qu.: 1.0 3rd Qu.: 1 3rd Qu.: 4.00 FLA : 17300
## Max. :2526.0 Max. :10927 Max. :13.00 MC : 16896
## NA's :463511 (Other):129963
## imputed.value.nid added.nutrient number.of.studies min
## Min. : 1001 :530398 Min. : 0 Min. : 0
## 1st Qu.: 5641 N: 1699 1st Qu.: 1 1st Qu.: 0
## Median :10068 Y: 2445 Median : 1 Median : 0
## Mean :10658 Mean : 1 Mean : 31
## 3rd Qu.:14154 3rd Qu.: 1 3rd Qu.: 4
## Max. :44260 Max. :22 Max. :19729
## NA's :513070 NA's :483599 NA's :486210
## max degrees.of.freedom errorlower.95% error.upper.95%
## Min. : 0 Min. : 0 Min. :-125682 Min. : 0
## 1st Qu.: 0 1st Qu.: 2 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 3 Median : 0 Median : 1
## Mean : 54 Mean : 5 Mean : 10 Mean : 97
## 3rd Qu.: 6 3rd Qu.: 5 3rd Qu.: 6 3rd Qu.: 14
## Max. :56000 Max. :1299 Max. : 27219 Max. :151995
## NA's :486210 NA's :500126 NA's :505393 NA's :505393
## stats.comments confidence.code nutrient.units INFOFOODS.tag
## :486966 :533173 g :287449 VITD : 8308
## 2, 3 : 19795 A: 14 mg :136068 : 8221
## 1 : 9457 B: 32 mcg : 72202 CHOCDF : 7538
## 1, 2, 3: 8318 C: 799 IU : 11301 ENERC_KCAL: 7538
## 4 : 5358 D: 524 kJ : 7538 ENERC_KJ : 7538
## 2 : 3369 kcal : 7538 (Other) :487943
## (Other): 1279 (Other): 12446 NA's : 7456
## nutrient.description nutrient.value.decimal.places
## Energy : 15076 Min. :0.00
## Carbohydrate, by difference: 7538 1st Qu.:0.00
## Protein : 7538 Median :3.00
## Total lipid (fat) : 7538 Mean :1.92
## Water : 7534 3rd Qu.:3.00
## Ash : 7533 Max. :3.00
## (Other) :481785
## sort.order
## Min. : 100
## 1st Qu.: 6100
## Median : 7920
## Mean : 9426
## 3rd Qu.:14000
## Max. :18400
##
## food.long.description
## Seeds, pumpkin and squash seed kernels, roasted, with salt added: 131
## Seeds, pumpkin and squash seed kernels, roasted, without salt : 131
## Fast foods, chicken tenders : 130
## Snacks, bagel chips : 130
## Snacks, pita chips, salted : 130
## Cereals ready-to-eat, KELLOGG, KELLOGG'S RICE KRISPIES : 129
## (Other) :533761
## food.short.desc
## POPCORN,OIL-POPPED,LOFAT : 196
## OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT: 192
## BABYFOOD,MEAT,BF,STR : 177
## PUMPKIN&SQUASH SD KRNLS,RSTD,W/SALT : 131
## PUMPKIN&SQUASH SD KRNLS,RSTD,WO/SALT : 131
## FAST FOODS,CHICK TENDERS : 130
## (Other) :533585
## comon.name manufacturer complete.profile
## :478260 :480811 :283554
## KFC : 2235 Kellogg, Co. : 7867 Y:250988
## family style: 1767 The Quaker Oats, Co. : 6418
## hamburger : 1449 Campbell Soup Co. : 5655
## buffalo : 1032 McDonald's Corporation: 3975
## Sweetpotato : 1016 General Mills Inc. : 3743
## (Other) : 48783 (Other) : 26073
## inedible.parts percent.inedible
## :416219 Min. : 0.0
## Bone : 29270 1st Qu.: 0.0
## Bone and connective tissue: 8021 Median : 0.0
## Connective tissue : 5192 Mean : 5.3
## Separable fat : 2441 3rd Qu.: 0.0
## Shells : 2315 Max. :99.0
## (Other) : 71084 NA's :2828
## scientific.name nitrogen.protein.factor
## :484938 Min. :0
## Phaseolus vulgaris : 1536 1st Qu.:6
## Struthio camelus : 1389 Median :6
## Dromaius novaehollandiae: 1036 Mean :6
## Bison bison : 1032 3rd Qu.:6
## Agaricus bisporus : 1027 Max. :7
## (Other) : 43584 NA's :120117
## protein.calories.factor fat.calories.factor carb.calories.factor
## Min. :0 Min. :0 Min. :0
## 1st Qu.:3 1st Qu.:8 1st Qu.:4
## Median :4 Median :9 Median :4
## Mean :4 Mean :9 Mean :4
## 3rd Qu.:4 3rd Qu.:9 3rd Qu.:4
## Max. :4 Max. :9 Max. :4
## NA's :166426 NA's :158354 NA's :166763
## food.group
## Vegetables and Vegetable Products: 60447
## Beef Products : 43697
## Baked Products : 36192
## Pork Products : 30940
## Poultry Products : 29542
## Breakfast Cereals : 27665
## (Other) :306059
cat("total data in memory:", object.size(data)/2^20, " MB\n")
## total data in memory: 95.15 MB
Reformat the data set so that we have one row per food item, with columns for:
# How many mg magnesium could we expect in foods in general?
magnesium <- subset(data, nutrient.description == "Magnesium, Mg") # or nid==304, got from the docs
hist(magnesium$amount)
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
# One measurement per nutrient per column, NA for missing
pivoted <- cast(data, food.id ~ nutrient.id, value = "amount")
# Give columns nice names
map <- unique(data.frame(data$nutrient.id, data$nutrient.description))
names(pivoted) <- c("food.id", as.character(map[order(map$data.nutrient.id),
]$data.nutrient.description))
names(pivoted) <- gsub("[^a-zA-Z0-9]", ".", names(pivoted))
head(pivoted)
## food.id Protein Total.lipid..fat. Carbohydrate..by.difference Ash
## 1 1001 0.85 81.11 0.06 2.11
## 2 1002 0.85 81.11 0.06 2.11
## 3 1003 0.28 99.48 0.00 0.00
## 4 1004 21.40 28.74 2.34 5.11
## 5 1005 23.24 29.68 2.79 3.18
## 6 1006 20.75 27.68 0.45 2.70
## Energy Starch Sucrose Glucose..dextrose. Fructose Lactose Maltose
## 1 717 NA NA NA NA NA NA
## 2 717 NA NA NA NA NA NA
## 3 876 NA NA NA NA NA NA
## 4 353 NA NA NA NA NA NA
## 5 371 NA NA NA NA NA NA
## 6 334 NA NA NA NA NA NA
## Alcohol..ethyl Water Adjusted.Protein Caffeine Theobromine Energy
## 1 0 15.87 NA 0 0 2999
## 2 0 15.87 NA 0 0 2999
## 3 0 0.24 NA 0 0 3664
## 4 0 42.41 NA 0 0 1477
## 5 0 41.11 NA 0 0 1552
## 6 0 48.42 NA 0 0 1396
## Sugars..total Galactose Fiber..total.dietary Calcium..Ca Iron..Fe
## 1 0.06 NA 0 24 0.02
## 2 0.06 NA 0 24 0.16
## 3 0.00 NA 0 4 0.00
## 4 0.50 NA 0 528 0.31
## 5 0.51 NA 0 674 0.43
## 6 0.45 NA 0 184 0.50
## Magnesium..Mg Phosphorus..P Potassium..K Sodium..Na Zinc..Zn Copper..Cu
## 1 2 24 24 576 0.09 0.000
## 2 2 23 26 827 0.05 0.016
## 3 0 3 5 2 0.01 0.001
## 4 23 387 256 1395 2.66 0.040
## 5 24 451 136 560 2.60 0.024
## 6 20 188 152 629 2.38 0.019
## Fluoride..F Manganese..Mn Selenium..Se Vitamin.A..IU Retinol
## 1 2.8 0.000 1.0 2499 671
## 2 2.8 0.004 1.0 2499 671
## 3 NA 0.000 0.0 3069 824
## 4 NA 0.009 14.5 763 192
## 5 NA 0.012 14.5 1080 286
## 6 NA 0.034 14.5 592 173
## Vitamin.A..RAE Carotene..beta Carotene..alpha
## 1 684 158 0
## 2 684 158 0
## 3 840 193 0
## 4 198 74 0
## 5 292 76 0
## 6 174 9 0
## Vitamin.E..alpha.tocopherol. Vitamin.D Vitamin.D2..ergocalciferol.
## 1 2.32 60 NA
## 2 2.32 60 NA
## 3 2.80 73 NA
## 4 0.25 21 NA
## 5 0.26 22 NA
## 6 0.24 20 NA
## Vitamin.D3..cholecalciferol. Vitamin.D..D2...D3. Cryptoxanthin..beta
## 1 1.5 1.5 0
## 2 1.5 1.5 0
## 3 1.8 1.8 0
## 4 0.5 0.5 0
## 5 0.5 0.5 0
## 6 0.5 0.5 0
## Lycopene Lutein...zeaxanthin Tocopherol..beta Tocopherol..gamma
## 1 0 0 0 0
## 2 0 0 NA NA
## 3 0 0 NA NA
## 4 0 0 NA NA
## 5 0 0 NA NA
## 6 0 0 NA NA
## Tocopherol..delta Vitamin.C..total.ascorbic.acid Thiamin Riboflavin
## 1 0 0 0.005 0.034
## 2 NA 0 0.005 0.034
## 3 NA 0 0.001 0.005
## 4 NA 0 0.029 0.382
## 5 NA 0 0.014 0.351
## 6 NA 0 0.070 0.520
## Niacin Pantothenic.acid Vitamin.B.6 Folate..total Vitamin.B.12
## 1 0.042 0.110 0.003 3 0.17
## 2 0.042 0.110 0.003 3 0.13
## 3 0.003 0.010 0.001 0 0.01
## 4 1.016 1.729 0.166 36 1.22
## 5 0.118 0.288 0.065 20 1.26
## 6 0.380 0.690 0.235 65 1.65
## Choline..total Vitamin.K..phylloquinone. Folic.acid Folate..food
## 1 18.8 7.0 0 3
## 2 18.8 7.0 0 3
## 3 22.3 8.6 0 0
## 4 15.4 2.4 0 36
## 5 15.4 2.5 0 20
## 6 15.4 2.3 0 65
## Folate..DFE Betaine Tryptophan Threonine Isoleucine Leucine Lysine
## 1 3 0.3 0.012 0.038 0.051 0.083 0.067
## 2 3 0.3 0.012 0.038 0.051 0.083 0.067
## 3 0 NA 0.004 0.013 0.017 0.027 0.022
## 4 36 NA 0.312 0.785 1.124 1.919 1.852
## 5 20 NA 0.324 0.882 1.137 2.244 2.124
## 6 65 NA 0.322 0.751 1.015 1.929 1.851
## Methionine Cystine Phenylalanine Tyrosine Valine Arginine Histidine
## 1 0.021 0.008 0.041 0.041 0.057 0.031 0.023
## 2 0.021 0.008 0.041 0.041 0.057 0.031 0.023
## 3 0.007 0.003 0.014 0.014 0.019 0.010 0.008
## 4 0.584 0.107 1.087 1.295 1.556 0.711 0.758
## 5 0.565 0.131 1.231 1.115 1.472 0.874 0.823
## 6 0.592 0.114 1.158 1.200 1.340 0.735 0.716
## Alanine Aspartic.acid Glutamic.acid Glycine Proline Serine
## 1 0.029 0.064 0.178 0.018 0.082 0.046
## 2 0.029 0.064 0.178 0.018 0.082 0.046
## 3 0.010 0.021 0.059 0.006 0.027 0.015
## 4 0.644 1.436 5.179 0.406 2.100 1.120
## 5 0.670 1.588 5.515 0.437 2.575 1.289
## 6 0.859 1.350 4.387 0.397 2.459 1.168
## Hydroxyproline Vitamin.E..added Vitamin.B.12..added Cholesterol
## 1 NA 0 0 215
## 2 NA 0 0 219
## 3 NA 0 0 256
## 4 NA 0 0 75
## 5 NA 0 0 94
## 6 NA 0 0 100
## Fatty.acids..total.trans Fatty.acids..total.saturated 4.0 6.0 8.0
## 1 NA 51.37 3.226 2.007 1.190
## 2 NA 50.49 2.630 1.557 0.906
## 3 NA 61.92 3.226 1.910 1.112
## 4 NA 18.67 0.658 0.361 0.247
## 5 NA 18.76 0.914 0.373 0.299
## 6 NA 17.41 0.564 0.323 0.297
## 10.0 12.0 14.0 16.0 18.0 20.0 18.1.undifferentiated
## 1 2.529 2.587 7.436 21.697 9.999 0.138 19.961
## 2 2.034 2.277 8.157 21.334 9.829 NA 20.405
## 3 2.495 2.793 10.005 26.166 12.056 NA 25.026
## 4 0.601 0.491 3.301 9.153 3.235 NA 6.622
## 5 0.585 0.482 3.227 8.655 3.455 NA 7.401
## 6 0.673 0.504 3.065 8.246 2.880 NA 6.563
## 18.2.undifferentiated 18.3.undifferentiated 20.4.undifferentiated
## 1 2.728 0.315 0
## 2 1.832 1.180 0
## 3 2.247 1.447 0
## 4 0.536 0.264 0
## 5 0.491 0.293 0
## 6 0.513 0.313 0
## 22.6.n.3..DHA. 22.0 14.1 16.1.undifferentiated 18.4 20.1 20.5.n.3..EPA.
## 1 0 NA NA 0.961 0 0.1 0
## 2 0 NA NA 1.816 0 0.0 0
## 3 0 NA NA 2.228 0 0.0 0
## 4 0 NA NA 0.816 0 0.0 0
## 5 0 NA NA 0.817 0 0.0 0
## 6 0 NA NA 1.007 0 0.0 0
## 22.1.undifferentiated 22.5.n.3..DPA. Phytosterols Stigmasterol
## 1 0 0 NA 0
## 2 0 0 NA NA
## 3 0 0 NA NA
## 4 0 0 NA NA
## 5 0 0 NA NA
## 6 0 0 NA NA
## Campesterol Beta.sitosterol Fatty.acids..total.monounsaturated
## 1 0 4 21.021
## 2 NA NA 23.426
## 3 NA NA 28.732
## 4 NA NA 7.778
## 5 NA NA 8.598
## 6 NA NA 8.013
## Fatty.acids..total.polyunsaturated 15.0 17.0 24.0 16.1.t 18.1.t 22.1.t
## 1 3.043 NA 0.56 NA NA 2.982 NA
## 2 3.012 NA NA NA NA NA NA
## 3 3.694 NA NA NA NA NA NA
## 4 0.800 NA NA NA NA NA NA
## 5 0.784 NA NA NA NA NA NA
## 6 0.826 NA NA NA NA NA NA
## 18.2.t.not.further.defined 18.2.i 18.2.t.t 18.2.CLAs 24.1.c 20.2.n.6.c.c
## 1 NA 0.296 NA 0.267 NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## 16.1.c 18.1.c 18.2.n.6.c.c 22.1.c 18.3.n.6.c.c.c 17.1
## 1 0.961 16.98 2.166 NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## 20.3.undifferentiated Fatty.acids..total.trans.monoenoic
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Fatty.acids..total.trans.polyenoic 13.0 15.1 18.3.n.3.c.c.c..ALA.
## 1 NA NA NA 0.315
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 20.3.n.3 20.3.n.6 20.4.n.6 18.3i 21.5 22.4
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
# This is R, so we just leave the variable in
Start to look at pairwise scatterplots of magnesium versus different features. Do any stand out to you? Are there certain things that seem to come along often with magnesium?
corr <- as.data.frame(cor(pivoted$Magnesium..Mg, pivoted[, names(pivoted)[2:ncol(pivoted)]],
use = "pairwise.complete.obs"))
names(corr) <- names(pivoted)[2:ncol(pivoted)]
sorted <- t(sort(corr))
sorted
## [,1]
## Water -0.4329084
## 18.2.i -0.3286633
## Campesterol -0.2042567
## 20.3.n.6 -0.1978139
## 18.2.t.not.further.defined -0.1655530
## 18.2.CLAs -0.1540811
## Hydroxyproline -0.1496551
## Stigmasterol -0.1414723
## 18.2.t.t -0.1404232
## Fatty.acids..total.trans.polyenoic -0.1351530
## 16.1.t -0.1340378
## Fatty.acids..total.trans.monoenoic -0.1220644
## Fluoride..F -0.1205641
## 18.1.t -0.1202294
## Fatty.acids..total.trans -0.1125682
## 21.5 -0.1105754
## 16.1.c -0.1075911
## 15.0 -0.1064847
## 17.0 -0.1005923
## 14.1 -0.0989674
## 22.4 -0.0931966
## 18.3.n.3.c.c.c..ALA. -0.0861194
## Cholesterol -0.0802892
## 20.4.undifferentiated -0.0719026
## 13.0 -0.0678139
## 16.1.undifferentiated -0.0656362
## 20.2.n.6.c.c -0.0655167
## 18.3.n.6.c.c.c -0.0610707
## Tocopherol..delta -0.0509159
## Alcohol..ethyl -0.0433235
## 17.1 -0.0414351
## 20.3.n.3 -0.0410460
## Glucose..dextrose. -0.0365333
## 6.0 -0.0346117
## Lactose -0.0319846
## 14.0 -0.0316003
## 18.0 -0.0287940
## 4.0 -0.0280181
## Fatty.acids..total.saturated -0.0271370
## 18.4 -0.0194737
## 22.5.n.3..DPA. -0.0191399
## 16.0 -0.0160485
## Vitamin.D -0.0150318
## Vitamin.D..D2...D3. -0.0149966
## 10.0 -0.0114870
## 20.5.n.3..EPA. -0.0108475
## 12.0 -0.0103500
## Sodium..Na -0.0088947
## Fructose -0.0067045
## 22.6.n.3..DHA. -0.0041905
## Vitamin.B.12 -0.0015001
## Carotene..alpha 0.0005921
## 8.0 0.0012007
## 22.1.undifferentiated 0.0032097
## Retinol 0.0036551
## 18.1.c 0.0060274
## 15.1 0.0061960
## Lycopene 0.0066968
## 22.1.c 0.0101937
## Vitamin.D2..ergocalciferol. 0.0103552
## 18.3i 0.0113164
## 20.3.undifferentiated 0.0147643
## 20.1 0.0169916
## Vitamin.A..RAE 0.0186488
## Phytosterols 0.0283239
## Lysine 0.0338393
## Cryptoxanthin..beta 0.0412096
## 18.3.undifferentiated 0.0503513
## Starch 0.0516952
## Lutein...zeaxanthin 0.0523307
## Beta.sitosterol 0.0562936
## 18.2.n.6.c.c 0.0614297
## Carotene..beta 0.0622167
## Maltose 0.0646609
## Fatty.acids..total.monounsaturated 0.0668999
## Methionine 0.0672961
## Total.lipid..fat. 0.0749518
## Sugars..total 0.0777664
## 18.1.undifferentiated 0.0799733
## Vitamin.A..IU 0.0804608
## Sucrose 0.0821277
## Galactose 0.0899923
## Choline..total 0.0976150
## 22.1.t 0.1039422
## Vitamin.C..total.ascorbic.acid 0.1097878
## Alanine 0.1142263
## 22.0 0.1142871
## Folic.acid 0.1234879
## Selenium..Se 0.1249311
## Histidine 0.1256993
## Glycine 0.1273785
## Vitamin.D3..cholecalciferol. 0.1298490
## Vitamin.B.12..added 0.1418295
## Vitamin.E..added 0.1440923
## Caffeine 0.1465752
## Isoleucine 0.1558778
## Threonine 0.1571095
## Fatty.acids..total.polyunsaturated 0.1588344
## 24.0 0.1609758
## 24.1.c 0.1625686
## 18.2.undifferentiated 0.1700506
## Proline 0.1738391
## Leucine 0.1766833
## Tocopherol..beta 0.1807530
## Tocopherol..gamma 0.1812013
## Vitamin.K..phylloquinone. 0.1829710
## 20.0 0.1842858
## Tyrosine 0.1946465
## Valine 0.1965024
## Betaine 0.1999070
## Zinc..Zn 0.2175297
## Vitamin.E..alpha.tocopherol. 0.2223973
## Riboflavin 0.2247110
## Pantothenic.acid 0.2270824
## Protein 0.2295670
## Folate..DFE 0.2331226
## Niacin 0.2544632
## Aspartic.acid 0.2588573
## Ash 0.2721394
## Energy 0.2782127
## Energy.1 0.2782127
## Serine 0.2816614
## Thiamin 0.2817144
## Theobromine 0.2874173
## Vitamin.B.6 0.2878268
## Folate..total 0.2889119
## Phenylalanine 0.2900741
## Calcium..Ca 0.2920401
## Tryptophan 0.3065870
## Glutamic.acid 0.3117075
## Cystine 0.3206655
## Carbohydrate..by.difference 0.3223046
## Manganese..Mn 0.3244848
## Folate..food 0.3628731
## Copper..Cu 0.3629089
## Arginine 0.3694273
## Iron..Fe 0.3918252
## Phosphorus..P 0.4270172
## Potassium..K 0.5108993
## Fiber..total.dietary 0.6323530
## 20.4.n.6 0.7459222
## Adjusted.Protein 1.0000000
## Magnesium..Mg 1.0000000
plot(Magnesium..Mg ~ Adjusted.Protein, pivoted)
Well, that looks like no data.
plot(pivoted$Magnesium..Mg, pivoted[, "20.4.n.6"])
And this one's crap.
plot(Magnesium..Mg ~ Fiber..total.dietary, pivoted)
…better.
plot(Magnesium..Mg ~ Potassium..K, pivoted)
Not bad.
plot(Magnesium..Mg ~ Carbohydrate..by.difference, pivoted)
Let's get serious about this:
# library(ggplot2) (ggplot(melt(pivoted, id.vars=c('food.id',
# 'Magnesium..Mg'))) + aes(Magnesium..Mg, value) + geom_point() +
# facet_wrap(~variable))
The model you’re going to create is basically one that fits the best line through all of these scatterplots at the same time, which is why it’s called a linear model. You’re going to make a model that says that the average outcome fits the equation for a line, y = mx+b. (That’s in two dimensions anyway, one dimension for the input and one for the output. A “line” with two dimensions of input has the equation y = mx + nw + b, which actually cuts a plane instead).
# Pivoted label data
data <- read.csv("real_stuff.csv")
# Drop out colums we already eliminated in python
data$Vitamin.A..IU <- NULL
data$Fatty.acids..total.monounsaturated <- NULL
data$Fatty.acids..total.polyunsaturated <- NULL
data$Fatty.acids..total.trans.monoenoic <- NULL
data$Fatty.acids..total.trans.polyenoic <- NULL
# Load in food group data to use as indicator variables
id.group <- read.csv("id_group.csv")
merged <- merge(id.group, data)
Now fit a linear model using the ordinary least squares algorithm. Recall that this model will tell you how much the average quantity of magnesium (in mg) will vary for each increase of 1 in that feature. So if the coefficient for fat is 2.1 (I have no idea what it should be), then you should expect an additional 2.1mg of magnesium for every gram of fat. And if the coefficient for dairy is 5.5, then you should expect an additional 5.5mg of magnesium on average just for being a dairy product.
You shouldn’t need to code up OLS; there’s a plethora of good OLS algorithms in Python (StatsModels, ols.py and scikit-learn are all excellent choices) or you could load this data set in to R and work in there. Before you do that though, you should read ahead a bit. Also make sure you fit a regression without normalization and with an intercept, if those are options you can set.
How many mg is your model off by on average?
Let’s say you want a model that you can hold in your head. How good are each of the coefficients? We may be able to eliminate some of them by seeing which ones are the most strongly positive or negative and seem unlikely to include zero (that is, have no effect).
# No food group data
fit.no.group <- lm(Magnesium..Mg ~ . - food.id, data)
summary(fit.no.group)
##
## Call:
## lm(formula = Magnesium..Mg ~ . - food.id, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -276.8 -16.4 -4.1 8.9 470.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.42e+01 4.90e+00 -4.94 9.6e-07 ***
## Protein 2.39e+00 2.06e-01 11.62 < 2e-16 ***
## Total.lipid..fat. 8.45e-01 1.20e-01 7.04 4.1e-12 ***
## Carbohydrate..by.difference 3.15e-01 9.93e-02 3.18 0.0016 **
## Sugars..total 2.04e-01 1.51e-01 1.36 0.1754
## Fiber..total.dietary 5.67e+00 3.75e-01 15.14 < 2e-16 ***
## Calcium..Ca 2.33e-02 8.50e-03 2.74 0.0063 **
## Iron..Fe -1.35e-01 2.57e-01 -0.53 0.5980
## Sodium..Na -9.25e-03 3.75e-03 -2.47 0.0138 *
## Vitamin.A..RAE -1.03e-03 9.23e-04 -1.12 0.2642
## Vitamin.C..total.ascorbic.acid 3.84e-01 6.48e-02 5.93 4.6e-09 ***
## Cholesterol -1.55e-02 9.57e-03 -1.62 0.1050
## Fatty.acids..total.trans -1.64e+00 5.35e-01 -3.06 0.0023 **
## Fatty.acids..total.saturated -7.24e-01 2.20e-01 -3.29 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.6 on 791 degrees of freedom
## (6733 observations deleted due to missingness)
## Multiple R-squared: 0.507, Adjusted R-squared: 0.499
## F-statistic: 62.5 on 13 and 791 DF, p-value: <2e-16
fit.no.group2 <- lm(Magnesium..Mg ~ . - food.id - Sugars..total - Iron..Fe -
Vitamin.A..RAE - Cholesterol, data)
summary(fit.no.group2)
##
## Call:
## lm(formula = Magnesium..Mg ~ . - food.id - Sugars..total - Iron..Fe -
## Vitamin.A..RAE - Cholesterol, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -274.8 -16.2 -3.6 9.4 472.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.73264 4.83921 -5.11 4.0e-07 ***
## Protein 2.30238 0.20319 11.33 < 2e-16 ***
## Total.lipid..fat. 0.84787 0.12003 7.06 3.5e-12 ***
## Carbohydrate..by.difference 0.39812 0.06345 6.27 5.8e-10 ***
## Fiber..total.dietary 5.52282 0.35172 15.70 < 2e-16 ***
## Calcium..Ca 0.02278 0.00825 2.76 0.0059 **
## Sodium..Na -0.00923 0.00370 -2.49 0.0128 *
## Vitamin.C..total.ascorbic.acid 0.36746 0.06087 6.04 2.4e-09 ***
## Fatty.acids..total.trans -1.67752 0.53500 -3.14 0.0018 **
## Fatty.acids..total.saturated -0.68258 0.21922 -3.11 0.0019 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.7 on 795 degrees of freedom
## (6733 observations deleted due to missingness)
## Multiple R-squared: 0.502, Adjusted R-squared: 0.496
## F-statistic: 89 on 9 and 795 DF, p-value: <2e-16
# No food group data
fit <- lm(Magnesium..Mg ~ . - X - food.id, merged)
summary(fit)
##
## Call:
## lm(formula = Magnesium..Mg ~ . - X - food.id, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.84 -12.79 -0.04 10.50 285.92
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.51e+00 1.50e+01 0.43
## food.groupBaked Products 1.14e+01 1.51e+01 0.75
## food.groupBeef Products -6.27e+01 1.57e+01 -4.01
## food.groupBeverages 2.42e+01 1.89e+01 1.28
## food.groupBreakfast Cereals 1.95e+01 1.42e+01 1.37
## food.groupCereal Grains and Pasta 1.05e+01 1.54e+01 0.68
## food.groupDairy and Egg Products -1.65e+01 2.86e+01 -0.58
## food.groupEthnic Foods -8.04e+01 3.81e+01 -2.11
## food.groupFast Foods -2.59e+01 1.51e+01 -1.72
## food.groupFats and Oils -2.04e+01 1.68e+01 -1.21
## food.groupFinfish and Shellfish Products -1.39e+01 3.74e+01 -0.37
## food.groupFruits and Fruit Juices -1.12e+01 2.85e+01 -0.39
## food.groupLamb, Veal, and Game Products -5.88e+01 2.35e+01 -2.50
## food.groupLegumes and Legume Products -3.31e+01 1.52e+01 -2.17
## food.groupMeals, Entrees, and Sidedishes -1.56e+01 3.73e+01 -0.42
## food.groupNut and Seed Products 1.60e+02 1.86e+01 8.61
## food.groupPork Products -5.57e+01 1.53e+01 -3.63
## food.groupPoultry Products -5.92e+01 1.60e+01 -3.70
## food.groupSausages and Luncheon Meats -4.17e+01 1.64e+01 -2.54
## food.groupSnacks 5.28e+01 1.45e+01 3.63
## food.groupSoups, Sauces, and Gravies 1.59e+00 1.62e+01 0.10
## food.groupSpices and Herbs 4.76e+01 1.87e+01 2.55
## food.groupSweets 1.85e+01 1.55e+01 1.19
## food.groupVegetables and Vegetable Products -2.44e+00 1.73e+01 -0.14
## Protein 3.21e+00 2.51e-01 12.80
## Total.lipid..fat. 3.72e-01 1.26e-01 2.96
## Carbohydrate..by.difference -4.29e-01 1.39e-01 -3.08
## Sugars..total 2.28e-01 1.75e-01 1.30
## Fiber..total.dietary 4.35e+00 3.38e-01 12.84
## Calcium..Ca 1.23e-02 7.30e-03 1.69
## Iron..Fe 4.39e-01 2.62e-01 1.68
## Sodium..Na -4.78e-03 3.67e-03 -1.30
## Vitamin.A..RAE -6.89e-04 8.69e-04 -0.79
## Vitamin.C..total.ascorbic.acid 2.70e-01 5.59e-02 4.84
## Cholesterol 2.18e-03 8.08e-03 0.27
## Fatty.acids..total.trans -5.60e-01 4.49e-01 -1.25
## Fatty.acids..total.saturated -2.96e-01 1.82e-01 -1.63
## Pr(>|t|)
## (Intercept) 0.66382
## food.groupBaked Products 0.45058
## food.groupBeef Products 6.7e-05 ***
## food.groupBeverages 0.19962
## food.groupBreakfast Cereals 0.16973
## food.groupCereal Grains and Pasta 0.49482
## food.groupDairy and Egg Products 0.56251
## food.groupEthnic Foods 0.03525 *
## food.groupFast Foods 0.08586 .
## food.groupFats and Oils 0.22646
## food.groupFinfish and Shellfish Products 0.70998
## food.groupFruits and Fruit Juices 0.69380
## food.groupLamb, Veal, and Game Products 0.01249 *
## food.groupLegumes and Legume Products 0.02998 *
## food.groupMeals, Entrees, and Sidedishes 0.67548
## food.groupNut and Seed Products < 2e-16 ***
## food.groupPork Products 0.00030 ***
## food.groupPoultry Products 0.00023 ***
## food.groupSausages and Luncheon Meats 0.01134 *
## food.groupSnacks 0.00030 ***
## food.groupSoups, Sauces, and Gravies 0.92147
## food.groupSpices and Herbs 0.01095 *
## food.groupSweets 0.23320
## food.groupVegetables and Vegetable Products 0.88820
## Protein < 2e-16 ***
## Total.lipid..fat. 0.00319 **
## Carbohydrate..by.difference 0.00215 **
## Sugars..total 0.19261
## Fiber..total.dietary < 2e-16 ***
## Calcium..Ca 0.09150 .
## Iron..Fe 0.09410 .
## Sodium..Na 0.19338
## Vitamin.A..RAE 0.42784
## Vitamin.C..total.ascorbic.acid 1.6e-06 ***
## Cholesterol 0.78695
## Fatty.acids..total.trans 0.21295
## Fatty.acids..total.saturated 0.10373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.7 on 768 degrees of freedom
## (6733 observations deleted due to missingness)
## Multiple R-squared: 0.682, Adjusted R-squared: 0.667
## F-statistic: 45.7 on 36 and 768 DF, p-value: <2e-16
fit2 <- lm(Magnesium..Mg ~ . - X - food.id - Sodium..Na - Sugars..total - Iron..Fe -
Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans - Fatty.acids..total.saturated -
Vitamin.A..RAE, merged)
summary(fit2)
##
## Call:
## lm(formula = Magnesium..Mg ~ . - X - food.id - Sodium..Na - Sugars..total -
## Iron..Fe - Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans -
## Fatty.acids..total.saturated - Vitamin.A..RAE, data = merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.59 -11.60 -2.16 10.89 288.45
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.20830 14.88253 0.28
## food.groupBaked Products 10.12727 15.08122 0.67
## food.groupBeef Products -58.93361 15.37465 -3.83
## food.groupBeverages 26.91634 18.79306 1.43
## food.groupBreakfast Cereals 24.81798 13.81308 1.80
## food.groupCereal Grains and Pasta 8.81720 15.22526 0.58
## food.groupDairy and Egg Products -13.44246 28.44947 -0.47
## food.groupEthnic Foods -76.10468 38.02391 -2.00
## food.groupFast Foods -25.78984 14.86540 -1.73
## food.groupFats and Oils -21.61226 16.48300 -1.31
## food.groupFinfish and Shellfish Products -13.96120 37.39451 -0.37
## food.groupFruits and Fruit Juices -7.64936 28.37866 -0.27
## food.groupLamb, Veal, and Game Products -61.79421 21.48741 -2.88
## food.groupLegumes and Legume Products -32.41212 15.09434 -2.15
## food.groupMeals, Entrees, and Sidedishes -16.86988 37.28283 -0.45
## food.groupNut and Seed Products 167.86470 18.20938 9.22
## food.groupPork Products -54.66799 15.03012 -3.64
## food.groupPoultry Products -56.31371 15.70695 -3.59
## food.groupSausages and Luncheon Meats -42.88334 15.87745 -2.70
## food.groupSnacks 51.90391 14.48184 3.58
## food.groupSoups, Sauces, and Gravies 0.95879 15.90691 0.06
## food.groupSpices and Herbs 50.40825 18.57745 2.71
## food.groupSweets 24.74113 14.28421 1.73
## food.groupVegetables and Vegetable Products -1.49833 17.27685 -0.09
## Protein 3.17839 0.24739 12.85
## Total.lipid..fat. 0.24549 0.10026 2.45
## Carbohydrate..by.difference -0.32918 0.11305 -2.91
## Fiber..total.dietary 4.25724 0.31860 13.36
## Calcium..Ca 0.01626 0.00695 2.34
## Vitamin.C..total.ascorbic.acid 0.30657 0.05128 5.98
## Pr(>|t|)
## (Intercept) 0.77743
## food.groupBaked Products 0.50209
## food.groupBeef Products 0.00014 ***
## food.groupBeverages 0.15248
## food.groupBreakfast Cereals 0.07277 .
## food.groupCereal Grains and Pasta 0.56268
## food.groupDairy and Egg Products 0.63670
## food.groupEthnic Foods 0.04569 *
## food.groupFast Foods 0.08316 .
## food.groupFats and Oils 0.19018
## food.groupFinfish and Shellfish Products 0.70899
## food.groupFruits and Fruit Juices 0.78758
## food.groupLamb, Veal, and Game Products 0.00414 **
## food.groupLegumes and Legume Products 0.03208 *
## food.groupMeals, Entrees, and Sidedishes 0.65105
## food.groupNut and Seed Products < 2e-16 ***
## food.groupPork Products 0.00029 ***
## food.groupPoultry Products 0.00036 ***
## food.groupSausages and Luncheon Meats 0.00707 **
## food.groupSnacks 0.00036 ***
## food.groupSoups, Sauces, and Gravies 0.95195
## food.groupSpices and Herbs 0.00681 **
## food.groupSweets 0.08366 .
## food.groupVegetables and Vegetable Products 0.93091
## Protein < 2e-16 ***
## Total.lipid..fat. 0.01456 *
## Carbohydrate..by.difference 0.00370 **
## Fiber..total.dietary < 2e-16 ***
## Calcium..Ca 0.01954 *
## Vitamin.C..total.ascorbic.acid 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.8 on 775 degrees of freedom
## (6733 observations deleted due to missingness)
## Multiple R-squared: 0.678, Adjusted R-squared: 0.666
## F-statistic: 56.3 on 29 and 775 DF, p-value: <2e-16
## ===========================================================================
## X.Intercept. food.groupBaked.Products food.groupBeef.Products
## 1 4.208 10.127 -58.93
## 2 0.241 18.534 -61.92
## 3 -5.672 32.055 -57.46
## 4 -4.357 4.924 -37.28
## 5 13.491 1.533 -68.24
## 6 2.198 6.814 -56.64
## food.groupBeverages food.groupBreakfast.Cereals
## 1 26.916 24.82
## 2 6.291 31.99
## 3 38.713 43.44
## 4 31.011 19.50
## 5 14.386 14.84
## 6 35.029 30.46
## food.groupCereal.Grains.and.Pasta food.groupDairy.and.Egg.Products
## 1 8.817 -13.44
## 2 14.842 -28.35
## 3 30.107 10.87
## 4 14.449 NA
## 5 -1.792 -39.00
## 6 11.234 -27.45
## food.groupEthnic.Foods food.groupFast.Foods food.groupFats.and.Oils
## 1 -76.10 -25.790 -21.612
## 2 NA -27.357 -10.324
## 3 NA -16.229 -23.643
## 4 NA -9.616 -4.936
## 5 -85.39 -37.258 -21.218
## 6 NA -25.434 -15.959
## food.groupFinfish.and.Shellfish.Products
## 1 -13.961
## 2 NA
## 3 -5.738
## 4 -2.270
## 5 -21.961
## 6 -10.138
## food.groupFruits.and.Fruit.Juices
## 1 -7.649
## 2 1.902
## 3 NA
## 4 -6.021
## 5 -13.274
## 6 -3.927
## food.groupLamb..Veal..and.Game.Products
## 1 -61.79
## 2 -74.42
## 3 -66.82
## 4 -35.32
## 5 -74.88
## 6 -56.17
## food.groupLegumes.and.Legume.Products
## 1 -32.41
## 2 -20.04
## 3 -23.51
## 4 -19.45
## 5 -46.79
## 6 -22.90
## food.groupMeals..Entrees..and.Sidedishes food.groupNut.and.Seed.Products
## 1 -16.870 167.9
## 2 NA 216.7
## 3 NA 145.9
## 4 -7.078 181.3
## 5 -23.957 114.2
## 6 NA 114.6
## food.groupPork.Products food.groupPoultry.Products
## 1 -54.67 -56.31
## 2 -55.30 -56.74
## 3 -52.27 -52.70
## 4 -33.43 -35.13
## 5 -62.72 -60.85
## 6 -51.99 -54.71
## food.groupSausages.and.Luncheon.Meats food.groupSnacks
## 1 -42.88 51.90
## 2 -40.71 54.53
## 3 -42.14 55.68
## 4 -24.65 39.98
## 5 -48.64 56.63
## 6 -40.85 62.75
## food.groupSoups..Sauces..and.Gravies food.groupSpices.and.Herbs
## 1 0.9588 50.41
## 2 -0.2460 24.31
## 3 6.6205 79.89
## 4 9.0240 55.66
## 5 -7.3165 69.01
## 6 2.3970 60.91
## food.groupSweets food.groupVegetables.and.Vegetable.Products Protein
## 1 24.74 -1.498 3.178
## 2 25.12 2.867 3.440
## 3 32.85 7.864 3.441
## 4 24.74 3.680 2.651
## 5 19.78 -9.831 3.190
## 6 28.78 -2.939 3.174
## Total.lipid..fat. Carbohydrate..by.difference Fiber..total.dietary
## 1 0.2455 -0.3292 4.257
## 2 0.1475 -0.2938 3.150
## 3 0.3957 -0.4223 2.575
## 4 0.1373 -0.1376 4.455
## 5 0.1375 -0.3130 4.186
## 6 0.2003 -0.3064 3.583
## Calcium..Ca Vitamin.C..total.ascorbic.acid
## 1 0.016257 0.3066
## 2 0.003855 0.4023
## 3 0.058446 0.3060
## 4 0.008491 0.2410
## 5 0.007731 0.3642
## 6 -0.010721 0.6662
If you look at the 2.5th and 97.5th percentile of the array of coefficients, you’ve got a simulated 95% confidence interval. Which nutrients or categories don’t have 0 in their 95% confidence interval?
Let's take a visual look at this:
Of those, which are the biggest? Write a sentence or two explaining what the best heuristics seem to be for buying food that’s high in magnesium. What coefficients are the modeling process least certain about?
## ===========================================================================
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
rf <- randomForest(factor(Magnesium..Mg > 50) ~ . - X - food.id - Sodium..Na -
Sugars..total - Iron..Fe - Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans -
Fatty.acids..total.saturated - Vitamin.A..RAE, merged, na.action = na.omit)
print(rf)
##
## Call:
## randomForest(formula = factor(Magnesium..Mg > 50) ~ . - X - food.id - Sodium..Na - Sugars..total - Iron..Fe - Vitamin.A..RAE - Cholesterol - Fatty.acids..total.trans - Fatty.acids..total.saturated - Vitamin.A..RAE, data = merged, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.72%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 632 13 0.02016
## TRUE 25 135 0.15625
varImpPlot(rf)