As a person of many talents, it’s time to take on a different job: nutrition analysis! Your goal is to analyze the sugar content of a sample of foods from around the world.

A large dataset called food.csv is ready for your use in the working directory. Instead of the usual read.csv(), however, you’re going to use the faster fread() from the data.table package. By default, the data will come in as a data table, but since you’re used to working with data frames, you can get fread() to return one by setting data.table = FALSE.

# Load data.table
library(data.table)
# Import food.csv as a data frame: food
food <- fread("food.csv", data.table = FALSE)

Examining the data

Let’s get an idea of what the dataset looks like in order to know how to proceed.

# View summary of food
summary(food)
       V1              code            url              creator            created_t        
 Min.   :   1.0   Min.   :100030   Length:1500        Length:1500        Min.   :1.332e+09  
 1st Qu.: 375.8   1st Qu.:124975   Class :character   Class :character   1st Qu.:1.394e+09  
 Median : 750.5   Median :149514   Mode  :character   Mode  :character   Median :1.425e+09  
 Mean   : 750.5   Mean   :149613                                         Mean   :1.414e+09  
 3rd Qu.:1125.2   3rd Qu.:174506                                         3rd Qu.:1.436e+09  
 Max.   :1500.0   Max.   :199880                                         Max.   :1.453e+09  
 created_datetime   last_modified_t     last_modified_datetime product_name       generic_name      
 Length:1500        Min.   :1.340e+09   Length:1500            Length:1500        Length:1500       
 Class :character   1st Qu.:1.424e+09   Class :character       Class :character   Class :character  
 Mode  :character   Median :1.437e+09   Mode  :character       Mode  :character   Mode  :character  
                    Mean   :1.430e+09                                                               
                    3rd Qu.:1.446e+09                                                               
                    Max.   :1.453e+09                                                               
   quantity          packaging         packaging_tags        brands          brands_tags       
 Length:1500        Length:1500        Length:1500        Length:1500        Length:1500       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
  categories        categories_tags    categories_en        origins          origins_tags      
 Length:1500        Length:1500        Length:1500        Length:1500        Length:1500       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
 manufacturing_places manufacturing_places_tags    labels          labels_tags         labels_en        
 Length:1500          Length:1500               Length:1500        Length:1500        Length:1500       
 Class :character     Class :character          Class :character   Class :character   Class :character  
 Mode  :character     Mode  :character          Mode  :character   Mode  :character   Mode  :character  
                                                                                                        
                                                                                                        
                                                                                                        
  emb_codes         emb_codes_tags     first_packaging_code_geo  cities        cities_tags       
 Length:1500        Length:1500        Length:1500              Mode:logical   Length:1500       
 Class :character   Class :character   Class :character         NA's:1500      Class :character  
 Mode  :character   Mode  :character   Mode  :character                        Mode  :character  
                                                                                                 
                                                                                                 
                                                                                                 
 purchase_places       stores           countries         countries_tags     countries_en      
 Length:1500        Length:1500        Length:1500        Length:1500        Length:1500       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
 ingredients_text    allergens         allergens_en      traces          traces_tags       
 Length:1500        Length:1500        Mode:logical   Length:1500        Length:1500       
 Class :character   Class :character   NA's:1500      Class :character   Class :character  
 Mode  :character   Mode  :character                  Mode  :character   Mode  :character  
                                                                                           
                                                                                           
                                                                                           
  traces_en         serving_size       no_nutriments   additives_n      additives        
 Length:1500        Length:1500        Mode:logical   Min.   : 0.000   Length:1500       
 Class :character   Class :character   NA's:1500      1st Qu.: 0.000   Class :character  
 Mode  :character   Mode  :character                  Median : 1.000   Mode  :character  
                                                      Mean   : 1.846                     
                                                      3rd Qu.: 3.000                     
                                                      Max.   :17.000                     
 additives_tags     additives_en       ingredients_from_palm_oil_n ingredients_from_palm_oil
 Length:1500        Length:1500        Min.   :0.0000              Mode:logical             
 Class :character   Class :character   1st Qu.:0.0000              NA's:1500                
 Mode  :character   Mode  :character   Median :0.0000                                       
                                       Mean   :0.0487                                       
                                       3rd Qu.:0.0000                                       
                                       Max.   :1.0000                                       
 ingredients_from_palm_oil_tags ingredients_that_may_be_from_palm_oil_n
 Length:1500                    Min.   :0.0000                         
 Class :character               1st Qu.:0.0000                         
 Mode  :character               Median :0.0000                         
                                Mean   :0.1379                         
                                3rd Qu.:0.0000                         
                                Max.   :4.0000                         
 ingredients_that_may_be_from_palm_oil ingredients_that_may_be_from_palm_oil_tags nutrition_grade_uk
 Mode:logical                          Length:1500                                Mode:logical      
 NA's:1500                             Class :character                           NA's:1500         
                                       Mode  :character                                             
                                                                                                    
                                                                                                    
                                                                                                    
 nutrition_grade_fr pnns_groups_1      pnns_groups_2         states          states_tags       
 Length:1500        Length:1500        Length:1500        Length:1500        Length:1500       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
  states_en         main_category      main_category_en    image_url         image_small_url   
 Length:1500        Length:1500        Length:1500        Length:1500        Length:1500       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
  energy_100g     energy_from_fat_100g    fat_100g      saturated_fat_100g butyric_acid_100g
 Min.   :   0.0   Min.   :   0.00      Min.   :  0.00   Min.   : 0.000     Mode:logical     
 1st Qu.: 369.8   1st Qu.:  35.98      1st Qu.:  0.90   1st Qu.: 0.200     NA's:1500        
 Median : 966.5   Median : 237.00      Median :  6.00   Median : 1.700                      
 Mean   :1083.2   Mean   : 668.41      Mean   : 13.39   Mean   : 4.874                      
 3rd Qu.:1641.5   3rd Qu.: 974.00      3rd Qu.: 20.00   3rd Qu.: 6.500                      
 Max.   :3700.0   Max.   :2900.00      Max.   :100.00   Max.   :57.000                      
 caproic_acid_100g caprylic_acid_100g capric_acid_100g lauric_acid_100g myristic_acid_100g
 Mode:logical      Mode:logical       Mode:logical     Mode:logical     Mode:logical      
 NA's:1500         NA's:1500          NA's:1500        NA's:1500        NA's:1500         
                                                                                          
                                                                                          
                                                                                          
                                                                                          
 palmitic_acid_100g stearic_acid_100g arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g
 Mode:logical       Mode:logical      Mode:logical        Mode:logical      Mode:logical        
 NA's:1500          NA's:1500         NA's:1500           NA's:1500         NA's:1500           
                                                                                                
                                                                                                
                                                                                                
                                                                                                
 cerotic_acid_100g montanic_acid_100g melissic_acid_100g monounsaturated_fat_100g polyunsaturated_fat_100g
 Mode:logical      Mode:logical       Mode:logical       Min.   : 0.00            Min.   : 0.400          
 NA's:1500         NA's:1500          NA's:1500          1st Qu.: 3.87            1st Qu.: 1.653          
                                                         Median : 9.50            Median : 3.900          
                                                         Mean   :19.77            Mean   : 9.986          
                                                         3rd Qu.:29.00            3rd Qu.:12.700          
                                                         Max.   :75.00            Max.   :46.200          
 omega_3_fat_100g alpha_linolenic_acid_100g eicosapentaenoic_acid_100g docosahexaenoic_acid_100g
 Min.   : 0.033   Min.   :0.0800            Min.   :0.721              Min.   :1.09             
 1st Qu.: 1.300   1st Qu.:0.0905            1st Qu.:0.721              1st Qu.:1.09             
 Median : 3.000   Median :0.1010            Median :0.721              Median :1.09             
 Mean   : 3.726   Mean   :0.1737            Mean   :0.721              Mean   :1.09             
 3rd Qu.: 3.200   3rd Qu.:0.2205            3rd Qu.:0.721              3rd Qu.:1.09             
 Max.   :12.400   Max.   :0.3400            Max.   :0.721              Max.   :1.09             
 omega_6_fat_100g linoleic_acid_100g arachidonic_acid_100g gamma_linolenic_acid_100g
 Min.   :0.25     Min.   :0.5000     Mode:logical          Mode:logical             
 1st Qu.:0.25     1st Qu.:0.5165     NA's:1500             NA's:1500                
 Median :0.25     Median :0.5330                                                    
 Mean   :0.25     Mean   :0.5330                                                    
 3rd Qu.:0.25     3rd Qu.:0.5495                                                    
 Max.   :0.25     Max.   :0.5660                                                    
 dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g elaidic_acid_100g gondoic_acid_100g
 Mode:logical                     Mode:logical     Mode:logical    Mode:logical      Mode:logical     
 NA's:1500                        NA's:1500        NA's:1500       NA's:1500         NA's:1500        
                                                                                                      
                                                                                                      
                                                                                                      
                                                                                                      
 mead_acid_100g erucic_acid_100g nervonic_acid_100g trans_fat_100g   cholesterol_100g carbohydrates_100g
 Mode:logical   Mode:logical     Mode:logical       Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   
 NA's:1500      NA's:1500        NA's:1500          1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  3.792   
                                                    Median :0.0000   Median :0.0000   Median : 13.500   
                                                    Mean   :0.0105   Mean   :0.0265   Mean   : 27.958   
                                                    3rd Qu.:0.0000   3rd Qu.:0.0026   3rd Qu.: 55.000   
                                                    Max.   :0.1000   Max.   :0.4300   Max.   :100.000   
  sugars_100g     sucrose_100g   glucose_100g   fructose_100g   lactose_100g   maltose_100g  
 Min.   :  0.00   Mode:logical   Mode:logical   Min.   :100    Min.   :0.000   Mode:logical  
 1st Qu.:  1.00   NA's:1500      NA's:1500      1st Qu.:100    1st Qu.:0.250   NA's:1500     
 Median :  4.05                                 Median :100    Median :0.500                 
 Mean   : 12.66                                 Mean   :100    Mean   :2.933                 
 3rd Qu.: 14.70                                 3rd Qu.:100    3rd Qu.:4.400                 
 Max.   :100.00                                 Max.   :100    Max.   :8.300                 
 maltodextrins_100g  starch_100g     polyols_100g     fiber_100g     proteins_100g     casein_100g  
 Mode:logical       Min.   : 0.00   Min.   : 8.60   Min.   : 0.000   Min.   : 0.000   Min.   :1.1   
 NA's:1500          1st Qu.: 9.45   1st Qu.:59.10   1st Qu.: 0.500   1st Qu.: 1.500   1st Qu.:1.1   
                    Median :39.50   Median :67.00   Median : 1.750   Median : 6.000   Median :1.1   
                    Mean   :30.73   Mean   :56.06   Mean   : 2.823   Mean   : 7.563   Mean   :1.1   
                    3rd Qu.:42.85   3rd Qu.:69.80   3rd Qu.: 3.500   3rd Qu.:10.675   3rd Qu.:1.1   
                    Max.   :71.00   Max.   :70.00   Max.   :46.700   Max.   :61.000   Max.   :1.1   
 serum_proteins_100g nucleotides_100g   salt_100g         sodium_100g       alcohol_100g  
 Mode:logical        Mode:logical     Min.   :  0.0000   Min.   : 0.0000   Min.   : 0.00  
 NA's:1500           NA's:1500        1st Qu.:  0.0438   1st Qu.: 0.0172   1st Qu.: 0.00  
                                      Median :  0.4498   Median : 0.1771   Median : 5.50  
                                      Mean   :  1.1205   Mean   : 0.4409   Mean   :10.07  
                                      3rd Qu.:  1.1938   3rd Qu.: 0.4700   3rd Qu.:13.00  
                                      Max.   :102.0000   Max.   :40.0000   Max.   :50.00  
 vitamin_a_100g   beta_carotene_100g vitamin_d_100g  vitamin_e_100g   vitamin_k_100g vitamin_c_100g 
 Min.   :0.0000   Mode:logical       Min.   :0e+00   Min.   :0.0005   Min.   :0      Min.   :0.000  
 1st Qu.:0.0000   NA's:1500          1st Qu.:0e+00   1st Qu.:0.0021   1st Qu.:0      1st Qu.:0.002  
 Median :0.0001                      Median :0e+00   Median :0.0044   Median :0      Median :0.019  
 Mean   :0.0003                      Mean   :0e+00   Mean   :0.0069   Mean   :0      Mean   :0.025  
 3rd Qu.:0.0006                      3rd Qu.:0e+00   3rd Qu.:0.0097   3rd Qu.:0      3rd Qu.:0.030  
 Max.   :0.0013                      Max.   :1e-04   Max.   :0.0320   Max.   :0      Max.   :0.217  
 vitamin_b1_100g  vitamin_b2_100g  vitamin_pp_100g  vitamin_b6_100g  vitamin_b9_100g vitamin_b12_100g
 Min.   :0.0001   Min.   :0.0002   Min.   :0.0006   Min.   :0.0001   Min.   :0e+00   Min.   :0       
 1st Qu.:0.0003   1st Qu.:0.0003   1st Qu.:0.0033   1st Qu.:0.0002   1st Qu.:0e+00   1st Qu.:0       
 Median :0.0004   Median :0.0009   Median :0.0069   Median :0.0008   Median :1e-04   Median :0       
 Mean   :0.0006   Mean   :0.0011   Mean   :0.0086   Mean   :0.0112   Mean   :1e-04   Mean   :0       
 3rd Qu.:0.0010   3rd Qu.:0.0013   3rd Qu.:0.0140   3rd Qu.:0.0012   3rd Qu.:2e-04   3rd Qu.:0       
 Max.   :0.0013   Max.   :0.0066   Max.   :0.0160   Max.   :0.2000   Max.   :2e-04   Max.   :0       
  biotin_100g   pantothenic_acid_100g  silica_100g    bicarbonate_100g potassium_100g   chloride_100g   
 Min.   :0      Min.   :0.0000        Min.   :8e-04   Min.   :0.0006   Min.   :0.0000   Min.   :0.0003  
 1st Qu.:0      1st Qu.:0.0007        1st Qu.:8e-04   1st Qu.:0.0678   1st Qu.:0.0650   1st Qu.:0.0006  
 Median :0      Median :0.0020        Median :8e-04   Median :0.1350   Median :0.1940   Median :0.0009  
 Mean   :0      Mean   :0.0027        Mean   :8e-04   Mean   :0.1692   Mean   :0.3288   Mean   :0.0144  
 3rd Qu.:0      3rd Qu.:0.0051        3rd Qu.:8e-04   3rd Qu.:0.2535   3rd Qu.:0.3670   3rd Qu.:0.0214  
 Max.   :0      Max.   :0.0060        Max.   :8e-04   Max.   :0.3720   Max.   :1.4300   Max.   :0.0420  
  calcium_100g    phosphorus_100g    iron_100g      magnesium_100g     zinc_100g       copper_100g   
 Min.   :0.0000   Min.   :0.0430   Min.   :0.0000   Min.   :0.0000   Min.   :0.0005   Min.   :0e+00  
 1st Qu.:0.0450   1st Qu.:0.1938   1st Qu.:0.0012   1st Qu.:0.0670   1st Qu.:0.0009   1st Qu.:1e-04  
 Median :0.1200   Median :0.3185   Median :0.0042   Median :0.1040   Median :0.0017   Median :1e-04  
 Mean   :0.2040   Mean   :0.3777   Mean   :0.0045   Mean   :0.1066   Mean   :0.0016   Mean   :1e-04  
 3rd Qu.:0.1985   3rd Qu.:0.4340   3rd Qu.:0.0077   3rd Qu.:0.1300   3rd Qu.:0.0022   3rd Qu.:1e-04  
 Max.   :1.0000   Max.   :1.1550   Max.   :0.0137   Max.   :0.3330   Max.   :0.0026   Max.   :1e-04  
 manganese_100g fluoride_100g  selenium_100g  chromium_100g  molybdenum_100g  iodine_100g   caffeine_100g 
 Min.   :0      Min.   :0      Min.   :0      Mode:logical   Mode:logical    Min.   :0      Mode:logical  
 1st Qu.:0      1st Qu.:0      1st Qu.:0      NA's:1500      NA's:1500       1st Qu.:0      NA's:1500     
 Median :0      Median :0      Median :0                                     Median :0                    
 Mean   :0      Mean   :0      Mean   :0                                     Mean   :0                    
 3rd Qu.:0      3rd Qu.:0      3rd Qu.:0                                     3rd Qu.:0                    
 Max.   :0      Max.   :0      Max.   :0                                     Max.   :0                    
 taurine_100g   ph_100g        fruits_vegetables_nuts_100g collagen_meat_protein_ratio_100g   cocoa_100g  
 Mode:logical   Mode:logical   Min.   : 2.00               Min.   :12.00                    Min.   :30    
 NA's:1500      NA's:1500      1st Qu.:11.25               1st Qu.:13.50                    1st Qu.:47    
                               Median :42.00               Median :15.00                    Median :60    
                               Mean   :36.88               Mean   :15.67                    Mean   :57    
                               3rd Qu.:52.25               3rd Qu.:17.50                    3rd Qu.:70    
                               Max.   :80.00               Max.   :20.00                    Max.   :81    
 chlorophyl_100g carbon_footprint_100g nutrition_score_fr_100g nutrition_score_uk_100g
 Mode:logical    Min.   : 12.00        Min.   :-12.000         Min.   :-12.000        
 NA's:1500       1st Qu.: 97.42        1st Qu.:  1.000         1st Qu.:  0.000        
                 Median :182.85        Median :  7.000         Median :  6.000        
                 Mean   :131.18        Mean   :  7.941         Mean   :  7.631        
                 3rd Qu.:190.78        3rd Qu.: 15.000         3rd Qu.: 16.000        
                 Max.   :198.70        Max.   : 28.000         Max.   : 28.000        
 [ reached getOption("max.print") -- omitted 1 row ]
# View head of food
head(food)

# View structure of food
str(food)
'data.frame':   1500 obs. of  160 variables:
 $ V1                                        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ code                                      : int  100030 100050 100079 100094 100124 100136 100194 100221 100257 100258 ...
 $ url                                       : chr  "http://world-en.openfoodfacts.org/product/3222475745867/confiture-de-fraise-fraise-des-bois-au-sucre-de-canne-casino-delices" "http://world-en.openfoodfacts.org/product/5410976880110/guylian-sea-shells-selection" "http://world-en.openfoodfacts.org/product/3264750423503/pates-de-fruits-aromatisees-jacquot" "http://world-en.openfoodfacts.org/product/8006040247001/nata-vegetal-a-base-de-soja-valsoia" ...
 $ creator                                   : chr  "sebleouf" "foodorigins" "domdom26" "javichu" ...
 $ created_t                                 : int  1424747544 1450316429 1428674916 1420416591 1420501121 1437983923 1442420988 1435686217 1436991777 1400516512 ...
 $ created_datetime                          : chr  "2015-02-24T03:12:24Z" "2015-12-17T01:40:29Z" "2015-04-10T14:08:36Z" "2015-01-05T00:09:51Z" ...
 $ last_modified_t                           : int  1438445887 1450817956 1428739289 1420417876 1445700917 1445577476 1442420988 1451405288 1436991779 1437236856 ...
 $ last_modified_datetime                    : chr  "2015-08-01T16:18:07Z" "2015-12-22T20:59:16Z" "2015-04-11T08:01:29Z" "2015-01-05T00:31:16Z" ...
 $ product_name                              : chr  "Confiture de fraise fraise des bois au sucre de canne" "Guylian Sea Shells Selection" "Pâtes de fruits aromatisées" "Nata vegetal a base de soja &quot;Valsoia&quot;" ...
 $ generic_name                              : chr  "" "" "Pâtes de fruits" "Nata vegetal a base de soja" ...
 $ quantity                                  : chr  "265 g" "375g" "1 kg" "200 ml" ...
 $ packaging                                 : chr  "Bocal,Verre" "Plastic,Box" "Carton,plastique" "Tetra Brik" ...
 $ packaging_tags                            : chr  "bocal,verre" "plastic,box" "carton,plastique" "tetra-brik" ...
 $ brands                                    : chr  "Casino Délices" "Guylian" "Jacquot" "Valsoia,//Propiedad de://,Valsoia S.p.A." ...
 $ brands_tags                               : chr  "casino-delices" "guylian" "jacquot" "valsoia,propiedad-de,valsoia-s-p-a" ...
 $ categories                                : chr  "Aliments et boissons à base de végétaux,Aliments d'origine végétale,Aliments à base de fruits et de légu"| __truncated__ "Chocolate" "pâtes de fruits" "Alimentos y bebidas de origen vegetal,Alimentos de origen vegetal,Natas vegetales,Natas vegetales a base de soj"| __truncated__ ...
 $ categories_tags                           : chr  "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:breakfasts,en:s"| __truncated__ "en:sugary-snacks,en:chocolates" "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:sugary-snacks,e"| __truncated__ "en:plant-based-foods-and-beverages,en:plant-based-foods,en:plant-based-creams,en:plant-based-creams-for-cooking"| __truncated__ ...
 $ categories_en                             : chr  "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Breakfasts,Spreads,Fruits b"| __truncated__ "Sugary snacks,Chocolates" "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Sugary snacks,Confectioneri"| __truncated__ "Plant-based foods and beverages,Plant-based foods,Plant-based creams,Plant-based creams for cooking,Soy-based c"| __truncated__ ...
 $ origins                                   : chr  "" "" "" "" ...
 $ origins_tags                              : chr  "" "" "" "" ...
 $ manufacturing_places                      : chr  "France" "Belgium" "" "Italia" ...
 $ manufacturing_places_tags                 : chr  "france" "belgium" "" "italia" ...
 $ labels                                    : chr  "" "" "" "Vegetariano,Vegano,Sin gluten,Sin OMG,Sin lactosa" ...
 $ labels_tags                               : chr  "" "" "" "en:vegetarian,en:vegan,en:gluten-free,en:no-gmos,en:no-lactose" ...
 $ labels_en                                 : chr  "" "" "" "Vegetarian,Vegan,Gluten-free,No GMOs,No lactose" ...
 $ emb_codes                                 : chr  "EMB 78015" "" "" "" ...
 $ emb_codes_tags                            : chr  "emb-78015" "" "" "" ...
 $ first_packaging_code_geo                  : chr  "48.983333,2.066667" "" "" "" ...
 $ cities                                    : logi  NA NA NA NA NA NA ...
 $ cities_tags                               : chr  "andresy-yvelines-france" "" "" "" ...
 $ purchase_places                           : chr  "Lyon,France" "NSW,Australia" "France" "Madrid,España" ...
 $ stores                                    : chr  "Casino" "" "" "El Corte Inglés" ...
 $ countries                                 : chr  "France" "Australia" "France" "España" ...
 $ countries_tags                            : chr  "en:france" "en:australia" "en:france" "en:spain" ...
 $ countries_en                              : chr  "France" "Australia" "France" "Spain" ...
 $ ingredients_text                          : chr  "Sucre de canne, fraises 40 g, fraises des bois 14 g, gélifiant : pectines de fruits, jus de citron concentré."| __truncated__ "" "Pulpe de pommes 50% , sucre, sirop de glucose, gélifiant : pectine, acidifiant : acide citrique, arômes, colo"| __truncated__ "Extracto de soja (78%) (agua, semillas de soja 8,3%), grasas vegetales, jarabe de glucosa, dextrosa, emulsionan"| __truncated__ ...
 $ allergens                                 : chr  "" "" "" "" ...
 $ allergens_en                              : logi  NA NA NA NA NA NA ...
 $ traces                                    : chr  "Lait,Fruits à coque" "" "" "" ...
 $ traces_tags                               : chr  "en:milk,en:nuts" "" "" "" ...
 $ traces_en                                 : chr  "Milk,Nuts" "" "" "" ...
 $ serving_size                              : chr  "15 g" "" "" "" ...
 $ no_nutriments                             : logi  NA NA NA NA NA NA ...
 $ additives_n                               : int  1 NA 2 5 0 NA NA 0 NA 1 ...
 $ additives                                 : chr  "[ sucre-de-canne -> fr:sucre-de-canne  ]  [ sucre-de -> fr:sucre-de  ]  [ sucre -> fr:sucre  ]  [ fraises-40-g "| __truncated__ "" "[ pulpe-de-pommes-50 -> fr:pulpe-de-pommes-50  ]  [ pulpe-de-pommes -> fr:pulpe-de-pommes  ]  [ pulpe-de -> fr:"| __truncated__ "[ extracto-de-soja -> es:extracto-de-soja  ]  [ 78 -> es:78  ]  [ agua -> es:agua  ]  [ semillas-de-soja-8 -> e"| __truncated__ ...
 $ additives_tags                            : chr  "en:e440" "" "en:e440,en:e330" "en:e471,en:e415,en:e407,en:e412,en:e306" ...
 $ additives_en                              : chr  "E440 - Pectins" "" "E440 - Pectins,E330 - Citric acid" "E471 - Mono- and diglycerides of fatty acids,E415 - Xanthan gum,E407 - Carrageenan,E412 - Guar gum,E306 - Tocop"| __truncated__ ...
 $ ingredients_from_palm_oil_n               : int  0 NA 0 0 0 NA NA 0 NA 0 ...
 $ ingredients_from_palm_oil                 : logi  NA NA NA NA NA NA ...
 $ ingredients_from_palm_oil_tags            : chr  "" "" "" "" ...
 $ ingredients_that_may_be_from_palm_oil_n   : int  0 NA 0 1 0 NA NA 0 NA 0 ...
 $ ingredients_that_may_be_from_palm_oil     : logi  NA NA NA NA NA NA ...
 $ ingredients_that_may_be_from_palm_oil_tags: chr  "" "" "" "e471-mono-et-diglycerides-d-acides-gras-alimentaires" ...
 $ nutrition_grade_uk                        : logi  NA NA NA NA NA NA ...
 $ nutrition_grade_fr                        : chr  "d" "" "" "d" ...
 $ pnns_groups_1                             : chr  "Sugary snacks" "Sugary snacks" "Fruits and vegetables" "unknown" ...
 $ pnns_groups_2                             : chr  "Sweets" "Chocolate products" "Fruits" "unknown" ...
 $ states                                    : chr  "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-b"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-compl"| __truncated__ ...
 $ states_tags                               : chr  "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-c"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed"| __truncated__ ...
 $ states_en                                 : chr  "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Cha"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristic"| __truncated__ ...
 $ main_category                             : chr  "en:plant-based-foods-and-beverages" "en:sugary-snacks" "en:plant-based-foods-and-beverages" "en:plant-based-foods-and-beverages" ...
 $ main_category_en                          : chr  "Plant-based foods and beverages" "Sugary snacks" "Plant-based foods and beverages" "Plant-based foods and beverages" ...
 $ image_url                                 : chr  "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.400.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.400.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.400.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.400.jpg" ...
 $ image_small_url                           : chr  "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.200.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.200.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.200.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.200.jpg" ...
 $ energy_100g                               : num  918 NA NA 766 2359 ...
 $ energy_from_fat_100g                      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ fat_100g                                  : num  0 NA NA 16.7 45.5 NA NA 25 NA 4 ...
 $ saturated_fat_100g                        : num  0 NA NA 9.9 5.2 NA NA 17 NA 0.54 ...
 $ butyric_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ caproic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ caprylic_acid_100g                        : logi  NA NA NA NA NA NA ...
 $ capric_acid_100g                          : logi  NA NA NA NA NA NA ...
 $ lauric_acid_100g                          : logi  NA NA NA NA NA NA ...
 $ myristic_acid_100g                        : logi  NA NA NA NA NA NA ...
 $ palmitic_acid_100g                        : logi  NA NA NA NA NA NA ...
 $ stearic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ arachidic_acid_100g                       : logi  NA NA NA NA NA NA ...
 $ behenic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ lignoceric_acid_100g                      : logi  NA NA NA NA NA NA ...
 $ cerotic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ montanic_acid_100g                        : logi  NA NA NA NA NA NA ...
 $ melissic_acid_100g                        : logi  NA NA NA NA NA NA ...
 $ monounsaturated_fat_100g                  : num  NA NA NA 2.9 9.5 NA NA NA NA NA ...
 $ polyunsaturated_fat_100g                  : num  NA NA NA 3.9 32.8 NA NA NA NA NA ...
 $ omega_3_fat_100g                          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ alpha_linolenic_acid_100g                 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ eicosapentaenoic_acid_100g                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ docosahexaenoic_acid_100g                 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ omega_6_fat_100g                          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ linoleic_acid_100g                        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ arachidonic_acid_100g                     : logi  NA NA NA NA NA NA ...
 $ gamma_linolenic_acid_100g                 : logi  NA NA NA NA NA NA ...
 $ dihomo_gamma_linolenic_acid_100g          : logi  NA NA NA NA NA NA ...
 $ omega_9_fat_100g                          : logi  NA NA NA NA NA NA ...
 $ oleic_acid_100g                           : logi  NA NA NA NA NA NA ...
 $ elaidic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ gondoic_acid_100g                         : logi  NA NA NA NA NA NA ...
 $ mead_acid_100g                            : logi  NA NA NA NA NA NA ...
 $ erucic_acid_100g                          : logi  NA NA NA NA NA NA ...
  [list output truncated]

Inspecting variables

The str(), head(), and summary() functions are designed to give you some information about a dataset without being overwhelming. However, this dataset is so large and has so many variables that even these outputs seemed pretty intimidating!

The glimpse() function from the dplyr package often formats information in a more approachable way.

Yet another option is to just look at the column names to see what kinds of data you have. As you look at the names, pay particular attention to any pairs that look like duplicates.

# View a glimpse of food
glimpse(food)

# View column names of food
names(food)

Removing duplicate info

That’s a lot of variables. To summarize, there’s some information on what and when information was added (1:9), meta information about food (10:17, 22:27), where it came from (18:21, 28:34), what it’s made of (35:52), nutrition grades (53:54), some unclear (55:63), and some nutritional information (64:159).

There are also many different pairs of columns that contain duplicate information.

# Define vector of duplicate cols (don't change)
duplicates <- c(4, 6, 11, 13, 15, 17, 18, 20, 22, 
                24, 25, 28, 32, 34, 36, 38, 40, 
                44, 46, 48, 51, 54, 65, 158)

# Remove duplicates from food: food2
food2 <- food[ ,-duplicates]

Removing useless info

Your dataset is much more manageable already.

In addition to duplicate columns, there are many columns containing information that you just can’t use. For example, the first few columns contain internal codes that don’t have any meaning to us. There are also some column names that aren’t clear enough to tell what they contain.

All of these columns can be deleted.

# Define useless vector (don't change)
useless <- c(1, 2, 3, 32:41)

# Remove useless columns from food2: food3
food3 <- food2[ ,-useless]

Finding columns

Looking much nicer! Recall from the first exercise that you are assuming you will be analyzing the sugar content of these foods. Therefore, your next step is to look at a summary of the nutrition information.

All of the columns with nutrition info contain the character string “100g” as part of their name, which makes it easy to identify them.

#load stringr package
library(stringr)
# Create vector of column indices: nutrition
nutrition <- str_detect(names(food3), "100g")

# View a summary of nutrition columns
summary(food3[nutrition])
 energy_from_fat_100g    fat_100g      saturated_fat_100g butyric_acid_100g caproic_acid_100g
 Min.   :   0.00      Min.   :  0.00   Min.   : 0.000     Mode:logical      Mode:logical     
 1st Qu.:  35.98      1st Qu.:  0.90   1st Qu.: 0.200     NA's:1500         NA's:1500        
 Median : 237.00      Median :  6.00   Median : 1.700                                        
 Mean   : 668.41      Mean   : 13.39   Mean   : 4.874                                        
 3rd Qu.: 974.00      3rd Qu.: 20.00   3rd Qu.: 6.500                                        
 Max.   :2900.00      Max.   :100.00   Max.   :57.000                                        
 NA's   :1486         NA's   :708      NA's   :797                                           
 caprylic_acid_100g capric_acid_100g lauric_acid_100g myristic_acid_100g palmitic_acid_100g
 Mode:logical       Mode:logical     Mode:logical     Mode:logical       Mode:logical      
 NA's:1500          NA's:1500        NA's:1500        NA's:1500          NA's:1500         
                                                                                           
                                                                                           
                                                                                           
                                                                                           
                                                                                           
 stearic_acid_100g arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g cerotic_acid_100g
 Mode:logical      Mode:logical        Mode:logical      Mode:logical         Mode:logical     
 NA's:1500         NA's:1500           NA's:1500         NA's:1500            NA's:1500        
                                                                                               
                                                                                               
                                                                                               
                                                                                               
                                                                                               
 montanic_acid_100g melissic_acid_100g monounsaturated_fat_100g polyunsaturated_fat_100g omega_3_fat_100g
 Mode:logical       Mode:logical       Min.   : 0.00            Min.   : 0.400           Min.   : 0.033  
 NA's:1500          NA's:1500          1st Qu.: 3.87            1st Qu.: 1.653           1st Qu.: 1.300  
                                       Median : 9.50            Median : 3.900           Median : 3.000  
                                       Mean   :19.77            Mean   : 9.986           Mean   : 3.726  
                                       3rd Qu.:29.00            3rd Qu.:12.700           3rd Qu.: 3.200  
                                       Max.   :75.00            Max.   :46.200           Max.   :12.400  
                                       NA's   :1465             NA's   :1464             NA's   :1491    
 alpha_linolenic_acid_100g eicosapentaenoic_acid_100g docosahexaenoic_acid_100g omega_6_fat_100g
 Min.   :0.0800            Min.   :0.721              Min.   :1.09              Min.   :0.25    
 1st Qu.:0.0905            1st Qu.:0.721              1st Qu.:1.09              1st Qu.:0.25    
 Median :0.1010            Median :0.721              Median :1.09              Median :0.25    
 Mean   :0.1737            Mean   :0.721              Mean   :1.09              Mean   :0.25    
 3rd Qu.:0.2205            3rd Qu.:0.721              3rd Qu.:1.09              3rd Qu.:0.25    
 Max.   :0.3400            Max.   :0.721              Max.   :1.09              Max.   :0.25    
 NA's   :1497              NA's   :1499               NA's   :1499              NA's   :1499    
 linoleic_acid_100g arachidonic_acid_100g gamma_linolenic_acid_100g dihomo_gamma_linolenic_acid_100g
 Min.   :0.5000     Mode:logical          Mode:logical              Mode:logical                    
 1st Qu.:0.5165     NA's:1500             NA's:1500                 NA's:1500                       
 Median :0.5330                                                                                     
 Mean   :0.5330                                                                                     
 3rd Qu.:0.5495                                                                                     
 Max.   :0.5660                                                                                     
 NA's   :1498                                                                                       
 omega_9_fat_100g oleic_acid_100g elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
 Mode:logical     Mode:logical    Mode:logical      Mode:logical      Mode:logical   Mode:logical    
 NA's:1500        NA's:1500       NA's:1500         NA's:1500         NA's:1500      NA's:1500       
                                                                                                     
                                                                                                     
                                                                                                     
                                                                                                     
                                                                                                     
 nervonic_acid_100g trans_fat_100g   cholesterol_100g carbohydrates_100g  sugars_100g     sucrose_100g  
 Mode:logical       Min.   :0.0000   Min.   :0.0000   Min.   :  0.000    Min.   :  0.00   Mode:logical  
 NA's:1500          1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  3.792    1st Qu.:  1.00   NA's:1500     
                    Median :0.0000   Median :0.0000   Median : 13.500    Median :  4.05                 
                    Mean   :0.0105   Mean   :0.0265   Mean   : 27.958    Mean   : 12.66                 
                    3rd Qu.:0.0000   3rd Qu.:0.0026   3rd Qu.: 55.000    3rd Qu.: 14.70                 
                    Max.   :0.1000   Max.   :0.4300   Max.   :100.000    Max.   :100.00                 
                    NA's   :1481     NA's   :1477     NA's   :708        NA's   :788                    
 glucose_100g   fructose_100g   lactose_100g   maltose_100g   maltodextrins_100g  starch_100g   
 Mode:logical   Min.   :100    Min.   :0.000   Mode:logical   Mode:logical       Min.   : 0.00  
 NA's:1500      1st Qu.:100    1st Qu.:0.250   NA's:1500      NA's:1500          1st Qu.: 9.45  
                Median :100    Median :0.500                                     Median :39.50  
                Mean   :100    Mean   :2.933                                     Mean   :30.73  
                3rd Qu.:100    3rd Qu.:4.400                                     3rd Qu.:42.85  
                Max.   :100    Max.   :8.300                                     Max.   :71.00  
                NA's   :1499   NA's   :1497                                      NA's   :1493   
  polyols_100g     fiber_100g     proteins_100g     casein_100g   serum_proteins_100g nucleotides_100g
 Min.   : 8.60   Min.   : 0.000   Min.   : 0.000   Min.   :1.1    Mode:logical        Mode:logical    
 1st Qu.:59.10   1st Qu.: 0.500   1st Qu.: 1.500   1st Qu.:1.1    NA's:1500           NA's:1500       
 Median :67.00   Median : 1.750   Median : 6.000   Median :1.1                                        
 Mean   :56.06   Mean   : 2.823   Mean   : 7.563   Mean   :1.1                                        
 3rd Qu.:69.80   3rd Qu.: 3.500   3rd Qu.:10.675   3rd Qu.:1.1                                        
 Max.   :70.00   Max.   :46.700   Max.   :61.000   Max.   :1.1                                        
 NA's   :1491    NA's   :994      NA's   :710      NA's   :1499                                       
   salt_100g         sodium_100g       alcohol_100g   vitamin_a_100g   beta_carotene_100g vitamin_d_100g 
 Min.   :  0.0000   Min.   : 0.0000   Min.   : 0.00   Min.   :0.0000   Mode:logical       Min.   :0e+00  
 1st Qu.:  0.0438   1st Qu.: 0.0172   1st Qu.: 0.00   1st Qu.:0.0000   NA's:1500          1st Qu.:0e+00  
 Median :  0.4498   Median : 0.1771   Median : 5.50   Median :0.0001                      Median :0e+00  
 Mean   :  1.1205   Mean   : 0.4409   Mean   :10.07   Mean   :0.0003                      Mean   :0e+00  
 3rd Qu.:  1.1938   3rd Qu.: 0.4700   3rd Qu.:13.00   3rd Qu.:0.0006                      3rd Qu.:0e+00  
 Max.   :102.0000   Max.   :40.0000   Max.   :50.00   Max.   :0.0013                      Max.   :1e-04  
 NA's   :780        NA's   :780       NA's   :1433    NA's   :1477                        NA's   :1485   
 vitamin_e_100g   vitamin_k_100g vitamin_c_100g  vitamin_b1_100g  vitamin_b2_100g  vitamin_pp_100g 
 Min.   :0.0005   Min.   :0      Min.   :0.000   Min.   :0.0001   Min.   :0.0002   Min.   :0.0006  
 1st Qu.:0.0021   1st Qu.:0      1st Qu.:0.002   1st Qu.:0.0003   1st Qu.:0.0003   1st Qu.:0.0033  
 Median :0.0044   Median :0      Median :0.019   Median :0.0004   Median :0.0009   Median :0.0069  
 Mean   :0.0069   Mean   :0      Mean   :0.025   Mean   :0.0006   Mean   :0.0011   Mean   :0.0086  
 3rd Qu.:0.0097   3rd Qu.:0      3rd Qu.:0.030   3rd Qu.:0.0010   3rd Qu.:0.0013   3rd Qu.:0.0140  
 Max.   :0.0320   Max.   :0      Max.   :0.217   Max.   :0.0013   Max.   :0.0066   Max.   :0.0160  
 NA's   :1478     NA's   :1498   NA's   :1459    NA's   :1478     NA's   :1483     NA's   :1484    
 vitamin_b6_100g  vitamin_b9_100g vitamin_b12_100g  biotin_100g   pantothenic_acid_100g  silica_100g   
 Min.   :0.0001   Min.   :0e+00   Min.   :0        Min.   :0      Min.   :0.0000        Min.   :8e-04  
 1st Qu.:0.0002   1st Qu.:0e+00   1st Qu.:0        1st Qu.:0      1st Qu.:0.0007        1st Qu.:8e-04  
 Median :0.0008   Median :1e-04   Median :0        Median :0      Median :0.0020        Median :8e-04  
 Mean   :0.0112   Mean   :1e-04   Mean   :0        Mean   :0      Mean   :0.0027        Mean   :8e-04  
 3rd Qu.:0.0012   3rd Qu.:2e-04   3rd Qu.:0        3rd Qu.:0      3rd Qu.:0.0051        3rd Qu.:8e-04  
 Max.   :0.2000   Max.   :2e-04   Max.   :0        Max.   :0      Max.   :0.0060        Max.   :8e-04  
 NA's   :1481     NA's   :1483    NA's   :1489     NA's   :1498   NA's   :1486          NA's   :1499   
 bicarbonate_100g potassium_100g   chloride_100g     calcium_100g    phosphorus_100g    iron_100g     
 Min.   :0.0006   Min.   :0.0000   Min.   :0.0003   Min.   :0.0000   Min.   :0.0430   Min.   :0.0000  
 1st Qu.:0.0678   1st Qu.:0.0650   1st Qu.:0.0006   1st Qu.:0.0450   1st Qu.:0.1938   1st Qu.:0.0012  
 Median :0.1350   Median :0.1940   Median :0.0009   Median :0.1200   Median :0.3185   Median :0.0042  
 Mean   :0.1692   Mean   :0.3288   Mean   :0.0144   Mean   :0.2040   Mean   :0.3777   Mean   :0.0045  
 3rd Qu.:0.2535   3rd Qu.:0.3670   3rd Qu.:0.0214   3rd Qu.:0.1985   3rd Qu.:0.4340   3rd Qu.:0.0077  
 Max.   :0.3720   Max.   :1.4300   Max.   :0.0420   Max.   :1.0000   Max.   :1.1550   Max.   :0.0137  
 NA's   :1497     NA's   :1487     NA's   :1497     NA's   :1449     NA's   :1488     NA's   :1463    
 magnesium_100g     zinc_100g       copper_100g    manganese_100g fluoride_100g  selenium_100g 
 Min.   :0.0000   Min.   :0.0005   Min.   :0e+00   Min.   :0      Min.   :0      Min.   :0     
 1st Qu.:0.0670   1st Qu.:0.0009   1st Qu.:1e-04   1st Qu.:0      1st Qu.:0      1st Qu.:0     
 Median :0.1040   Median :0.0017   Median :1e-04   Median :0      Median :0      Median :0     
 Mean   :0.1066   Mean   :0.0016   Mean   :1e-04   Mean   :0      Mean   :0      Mean   :0     
 3rd Qu.:0.1300   3rd Qu.:0.0022   3rd Qu.:1e-04   3rd Qu.:0      3rd Qu.:0      3rd Qu.:0     
 Max.   :0.3330   Max.   :0.0026   Max.   :1e-04   Max.   :0      Max.   :0      Max.   :0     
 NA's   :1479     NA's   :1493     NA's   :1498    NA's   :1499   NA's   :1498   NA's   :1499  
 chromium_100g  molybdenum_100g  iodine_100g   caffeine_100g  taurine_100g   ph_100g       
 Mode:logical   Mode:logical    Min.   :0      Mode:logical   Mode:logical   Mode:logical  
 NA's:1500      NA's:1500       1st Qu.:0      NA's:1500      NA's:1500      NA's:1500     
                                Median :0                                                  
                                Mean   :0                                                  
                                3rd Qu.:0                                                  
                                Max.   :0                                                  
                                NA's   :1499                                               
 fruits_vegetables_nuts_100g collagen_meat_protein_ratio_100g   cocoa_100g   chlorophyl_100g
 Min.   : 2.00               Min.   :12.00                    Min.   :30     Mode:logical   
 1st Qu.:11.25               1st Qu.:13.50                    1st Qu.:47     NA's:1500      
 Median :42.00               Median :15.00                    Median :60                    
 Mean   :36.88               Mean   :15.67                    Mean   :57                    
 3rd Qu.:52.25               3rd Qu.:17.50                    3rd Qu.:70                    
 Max.   :80.00               Max.   :20.00                    Max.   :81                    
 NA's   :1470                NA's   :1497                     NA's   :1491                  
 nutrition_score_fr_100g nutrition_score_uk_100g
 Min.   :-12.000         Min.   :-12.000        
 1st Qu.:  1.000         1st Qu.:  0.000        
 Median :  7.000         Median :  6.000        
 Mean   :  7.941         Mean   :  7.631        
 3rd Qu.: 15.000         3rd Qu.: 16.000        
 Max.   : 28.000         Max.   : 28.000        
 NA's   :825             NA's   :825            

Replacing missing values

Unfortunately, the summary revealed that the nutrition data are mostly NA values. After consulting with the lab technician, it appears that much of the data is missing because the food just doesn’t have those nutrients.

But all is not lost! The lab tech also said that for sugar content, zero values are sometimes entered explicitly, but sometimes the values are just left empty to denote a zero.

We can replace all NA values with zeroes in the sugars_100g column and make histograms to visualize the result. Then, we will exclude the observations which have no sugar to see how the distribution changes.

# Find indices of sugar NA values: missing
missing <- is.na(food3$sugars_100g)

# Replace NA values with 0
food3$sugars_100g[missing] <- 0

# Create first histogram
hist(food3$sugars_100g, breaks = 100)


# Create food4
food4 <- food3[food3$sugars_100g > 0, ]

# Create second histogram
hist(food4$sugars_100g, breaks = 100)

Excluding the observations which don’t contain any sugar, you can better visualize what the underlying distribution looks like.

Dealing with messy data

How many of these foods come in some sort of plastic packaging?

The dataset has information about packaging, but there’s a bit of a problem: it’s stored in several different languages (Spanish, French, and English). This takes messy data to a whole new level! There is no R package to selectively translate, but what if you could just work with the messy data directly?

The root word for plastic is same in English (plastic), French (plastique), and Spanish (plastico). To get a general idea of how many of these foods are packaged in plastic, we can look through the packaging column for the string “plasti”.

# Find entries containing "plasti": plastic
plastic <- str_detect(food3$packaging, "plasti")

# Print the sum of plastic
print(sum(plastic))
[1] 232
LS0tDQp0aXRsZTogIldvcmxkIEZvb2QiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDoNCiAgICAgIHRvY19jb2xsYXBzZWQ6IHRydWUNCiAgICB0b2NfZGVwdGg6IDMNCi0tLQ0KQXMgYSBwZXJzb24gb2YgbWFueSB0YWxlbnRzLCBpdCdzIHRpbWUgdG8gdGFrZSBvbiBhIGRpZmZlcmVudCBqb2I6IG51dHJpdGlvbiBhbmFseXNpcyEgWW91ciBnb2FsIGlzIHRvIGFuYWx5emUgdGhlIHN1Z2FyIGNvbnRlbnQgb2YgYSBzYW1wbGUgb2YgZm9vZHMgZnJvbSBhcm91bmQgdGhlIHdvcmxkLg0KDQpBIGxhcmdlIGRhdGFzZXQgY2FsbGVkIGZvb2QuY3N2IGlzIHJlYWR5IGZvciB5b3VyIHVzZSBpbiB0aGUgd29ya2luZyBkaXJlY3RvcnkuIEluc3RlYWQgb2YgdGhlIHVzdWFsIHJlYWQuY3N2KCksIGhvd2V2ZXIsIHlvdSdyZSBnb2luZyB0byB1c2UgdGhlIGZhc3RlciBmcmVhZCgpIGZyb20gdGhlIGRhdGEudGFibGUgcGFja2FnZS4gQnkgZGVmYXVsdCwgdGhlIGRhdGEgd2lsbCBjb21lIGluIGFzIGEgZGF0YSB0YWJsZSwgYnV0IHNpbmNlIHlvdSdyZSB1c2VkIHRvIHdvcmtpbmcgd2l0aCBkYXRhIGZyYW1lcywgeW91IGNhbiBnZXQgZnJlYWQoKSB0byByZXR1cm4gb25lIGJ5IHNldHRpbmcgZGF0YS50YWJsZSA9IEZBTFNFLg0KYGBge3IgcmVzdWx0cz0naGlkZSd9DQojIExvYWQgZGF0YS50YWJsZQ0KbGlicmFyeShkYXRhLnRhYmxlKQ0KDQojIEltcG9ydCBmb29kLmNzdiBhcyBhIGRhdGEgZnJhbWU6IGZvb2QNCmZvb2QgPC0gZnJlYWQoImZvb2QuY3N2IiwgZGF0YS50YWJsZSA9IEZBTFNFKQ0KYGBgDQojIyMgRXhhbWluaW5nIHRoZSBkYXRhDQoNCkxldCdzIGdldCBhbiBpZGVhIG9mIHdoYXQgdGhlIGRhdGFzZXQgbG9va3MgbGlrZSBpbiBvcmRlciB0byBrbm93IGhvdyB0byBwcm9jZWVkLg0KYGBge3J9DQojIFZpZXcgc3VtbWFyeSBvZiBmb29kDQpzdW1tYXJ5KGZvb2QpDQoNCiMgVmlldyBoZWFkIG9mIGZvb2QNCmhlYWQoZm9vZCkNCg0KIyBWaWV3IHN0cnVjdHVyZSBvZiBmb29kDQpzdHIoZm9vZCkNCmBgYA0KIyMjIEluc3BlY3RpbmcgdmFyaWFibGVzDQoNClRoZSBzdHIoKSwgaGVhZCgpLCBhbmQgc3VtbWFyeSgpIGZ1bmN0aW9ucyBhcmUgZGVzaWduZWQgdG8gZ2l2ZSB5b3Ugc29tZSBpbmZvcm1hdGlvbiBhYm91dCBhIGRhdGFzZXQgd2l0aG91dCBiZWluZyBvdmVyd2hlbG1pbmcuIEhvd2V2ZXIsIHRoaXMgZGF0YXNldCBpcyBzbyBsYXJnZSBhbmQgaGFzIHNvIG1hbnkgdmFyaWFibGVzIHRoYXQgZXZlbiB0aGVzZSBvdXRwdXRzIHNlZW1lZCBwcmV0dHkgaW50aW1pZGF0aW5nIQ0KDQpUaGUgZ2xpbXBzZSgpIGZ1bmN0aW9uIGZyb20gdGhlIGRwbHlyIHBhY2thZ2Ugb2Z0ZW4gZm9ybWF0cyBpbmZvcm1hdGlvbiBpbiBhIG1vcmUgYXBwcm9hY2hhYmxlIHdheS4NCg0KWWV0IGFub3RoZXIgb3B0aW9uIGlzIHRvIGp1c3QgbG9vayBhdCB0aGUgY29sdW1uIG5hbWVzIHRvIHNlZSB3aGF0IGtpbmRzIG9mIGRhdGEgeW91IGhhdmUuIEFzIHlvdSBsb29rIGF0IHRoZSBuYW1lcywgcGF5IHBhcnRpY3VsYXIgYXR0ZW50aW9uIHRvIGFueSBwYWlycyB0aGF0IGxvb2sgbGlrZSBkdXBsaWNhdGVzLg0KYGBge3IgcmVzdWx0cz0naGlkZSd9DQojIExvYWQgZHBseXINCmxpYnJhcnkoZHBseXIpDQpgYGANCmBgYHtyfQ0KIyBWaWV3IGEgZ2xpbXBzZSBvZiBmb29kDQpnbGltcHNlKGZvb2QpDQoNCiMgVmlldyBjb2x1bW4gbmFtZXMgb2YgZm9vZA0KbmFtZXMoZm9vZCkNCmBgYA0KIyMjIFJlbW92aW5nIGR1cGxpY2F0ZSBpbmZvDQoNClRoYXQncyBhIGxvdCBvZiB2YXJpYWJsZXMuIFRvIHN1bW1hcml6ZSwgdGhlcmUncyBzb21lIGluZm9ybWF0aW9uIG9uIHdoYXQgYW5kIHdoZW4gaW5mb3JtYXRpb24gd2FzIGFkZGVkICgxOjkpLCBtZXRhIGluZm9ybWF0aW9uIGFib3V0IGZvb2QgKDEwOjE3LCAyMjoyNyksIHdoZXJlIGl0IGNhbWUgZnJvbSAoMTg6MjEsIDI4OjM0KSwgd2hhdCBpdCdzIG1hZGUgb2YgKDM1OjUyKSwgbnV0cml0aW9uIGdyYWRlcyAoNTM6NTQpLCBzb21lIHVuY2xlYXIgKDU1OjYzKSwgYW5kIHNvbWUgbnV0cml0aW9uYWwgaW5mb3JtYXRpb24gKDY0OjE1OSkuDQoNClRoZXJlIGFyZSBhbHNvIG1hbnkgZGlmZmVyZW50IHBhaXJzIG9mIGNvbHVtbnMgdGhhdCBjb250YWluIGR1cGxpY2F0ZSBpbmZvcm1hdGlvbi4gDQoNCmBgYHtyfQ0KIyBEZWZpbmUgdmVjdG9yIG9mIGR1cGxpY2F0ZSBjb2xzIChkb24ndCBjaGFuZ2UpDQpkdXBsaWNhdGVzIDwtIGMoNCwgNiwgMTEsIDEzLCAxNSwgMTcsIDE4LCAyMCwgMjIsIA0KICAgICAgICAgICAgICAgIDI0LCAyNSwgMjgsIDMyLCAzNCwgMzYsIDM4LCA0MCwgDQogICAgICAgICAgICAgICAgNDQsIDQ2LCA0OCwgNTEsIDU0LCA2NSwgMTU4KQ0KDQojIFJlbW92ZSBkdXBsaWNhdGVzIGZyb20gZm9vZDogZm9vZDINCmZvb2QyIDwtIGZvb2RbICwtZHVwbGljYXRlc10NCmBgYA0KIyMjIFJlbW92aW5nIHVzZWxlc3MgaW5mbw0KDQpZb3VyIGRhdGFzZXQgaXMgbXVjaCBtb3JlIG1hbmFnZWFibGUgYWxyZWFkeS4NCg0KSW4gYWRkaXRpb24gdG8gZHVwbGljYXRlIGNvbHVtbnMsIHRoZXJlIGFyZSBtYW55IGNvbHVtbnMgY29udGFpbmluZyBpbmZvcm1hdGlvbiB0aGF0IHlvdSBqdXN0IGNhbid0IHVzZS4gRm9yIGV4YW1wbGUsIHRoZSBmaXJzdCBmZXcgY29sdW1ucyBjb250YWluIGludGVybmFsIGNvZGVzIHRoYXQgZG9uJ3QgaGF2ZSBhbnkgbWVhbmluZyB0byB1cy4gVGhlcmUgYXJlIGFsc28gc29tZSBjb2x1bW4gbmFtZXMgdGhhdCBhcmVuJ3QgY2xlYXIgZW5vdWdoIHRvIHRlbGwgd2hhdCB0aGV5IGNvbnRhaW4uDQoNCkFsbCBvZiB0aGVzZSBjb2x1bW5zIGNhbiBiZSBkZWxldGVkLiANCmBgYHtyfQ0KIyBEZWZpbmUgdXNlbGVzcyB2ZWN0b3IgKGRvbid0IGNoYW5nZSkNCnVzZWxlc3MgPC0gYygxLCAyLCAzLCAzMjo0MSkNCg0KIyBSZW1vdmUgdXNlbGVzcyBjb2x1bW5zIGZyb20gZm9vZDI6IGZvb2QzDQpmb29kMyA8LSBmb29kMlsgLC11c2VsZXNzXQ0KYGBgDQojIyMgRmluZGluZyBjb2x1bW5zDQoNCkxvb2tpbmcgbXVjaCBuaWNlciEgUmVjYWxsIGZyb20gdGhlIGZpcnN0IGV4ZXJjaXNlIHRoYXQgeW91IGFyZSBhc3N1bWluZyB5b3Ugd2lsbCBiZSBhbmFseXppbmcgdGhlIHN1Z2FyIGNvbnRlbnQgb2YgdGhlc2UgZm9vZHMuIFRoZXJlZm9yZSwgeW91ciBuZXh0IHN0ZXAgaXMgdG8gbG9vayBhdCBhIHN1bW1hcnkgb2YgdGhlIG51dHJpdGlvbiBpbmZvcm1hdGlvbi4NCg0KQWxsIG9mIHRoZSBjb2x1bW5zIHdpdGggbnV0cml0aW9uIGluZm8gY29udGFpbiB0aGUgY2hhcmFjdGVyIHN0cmluZyAiMTAwZyIgYXMgcGFydCBvZiB0aGVpciBuYW1lLCB3aGljaCBtYWtlcyBpdCBlYXN5IHRvIGlkZW50aWZ5IHRoZW0uDQpgYGB7cn0NCiNsb2FkIHN0cmluZ3IgcGFja2FnZQ0KbGlicmFyeShzdHJpbmdyKQ0KIyBDcmVhdGUgdmVjdG9yIG9mIGNvbHVtbiBpbmRpY2VzOiBudXRyaXRpb24NCm51dHJpdGlvbiA8LSBzdHJfZGV0ZWN0KG5hbWVzKGZvb2QzKSwgIjEwMGciKQ0KDQojIFZpZXcgYSBzdW1tYXJ5IG9mIG51dHJpdGlvbiBjb2x1bW5zDQpzdW1tYXJ5KGZvb2QzW251dHJpdGlvbl0pDQpgYGANCiMjIyBSZXBsYWNpbmcgbWlzc2luZyB2YWx1ZXMNCg0KVW5mb3J0dW5hdGVseSwgdGhlIHN1bW1hcnkgcmV2ZWFsZWQgdGhhdCB0aGUgbnV0cml0aW9uIGRhdGEgYXJlIG1vc3RseSBOQSB2YWx1ZXMuIEFmdGVyIGNvbnN1bHRpbmcgd2l0aCB0aGUgbGFiIHRlY2huaWNpYW4sIGl0IGFwcGVhcnMgdGhhdCBtdWNoIG9mIHRoZSBkYXRhIGlzIG1pc3NpbmcgYmVjYXVzZSB0aGUgZm9vZCBqdXN0IGRvZXNuJ3QgaGF2ZSB0aG9zZSBudXRyaWVudHMuDQoNCkJ1dCBhbGwgaXMgbm90IGxvc3QhIFRoZSBsYWIgdGVjaCBhbHNvIHNhaWQgdGhhdCBmb3Igc3VnYXIgY29udGVudCwgemVybyB2YWx1ZXMgYXJlIHNvbWV0aW1lcyBlbnRlcmVkIGV4cGxpY2l0bHksIGJ1dCBzb21ldGltZXMgdGhlIHZhbHVlcyBhcmUganVzdCBsZWZ0IGVtcHR5IHRvIGRlbm90ZSBhIHplcm8uIA0KDQpXZSBjYW4gcmVwbGFjZSBhbGwgTkEgdmFsdWVzIHdpdGggemVyb2VzIGluIHRoZSBzdWdhcnNfMTAwZyBjb2x1bW4gYW5kIG1ha2UgaGlzdG9ncmFtcyB0byB2aXN1YWxpemUgdGhlIHJlc3VsdC4gVGhlbiwgd2Ugd2lsbCBleGNsdWRlIHRoZSBvYnNlcnZhdGlvbnMgd2hpY2ggaGF2ZSBubyBzdWdhciB0byBzZWUgaG93IHRoZSBkaXN0cmlidXRpb24gY2hhbmdlcy4NCmBgYHtyfQ0KIyBGaW5kIGluZGljZXMgb2Ygc3VnYXIgTkEgdmFsdWVzOiBtaXNzaW5nDQptaXNzaW5nIDwtIGlzLm5hKGZvb2QzJHN1Z2Fyc18xMDBnKQ0KDQojIFJlcGxhY2UgTkEgdmFsdWVzIHdpdGggMA0KZm9vZDMkc3VnYXJzXzEwMGdbbWlzc2luZ10gPC0gMA0KDQojIENyZWF0ZSBmaXJzdCBoaXN0b2dyYW0NCmhpc3QoZm9vZDMkc3VnYXJzXzEwMGcsIGJyZWFrcyA9IDEwMCkNCg0KIyBDcmVhdGUgZm9vZDQNCmZvb2Q0IDwtIGZvb2QzW2Zvb2QzJHN1Z2Fyc18xMDBnID4gMCwgXQ0KDQojIENyZWF0ZSBzZWNvbmQgaGlzdG9ncmFtDQpoaXN0KGZvb2Q0JHN1Z2Fyc18xMDBnLCBicmVha3MgPSAxMDApDQpgYGANCiBFeGNsdWRpbmcgdGhlIG9ic2VydmF0aW9ucyB3aGljaCBkb24ndCBjb250YWluIGFueSBzdWdhciwgeW91IGNhbiBiZXR0ZXIgdmlzdWFsaXplIHdoYXQgdGhlIHVuZGVybHlpbmcgZGlzdHJpYnV0aW9uIGxvb2tzIGxpa2UuDQoNCiMjIyBEZWFsaW5nIHdpdGggbWVzc3kgZGF0YQ0KDQpIb3cgbWFueSBvZiB0aGVzZSBmb29kcyBjb21lIGluIHNvbWUgc29ydCBvZiBwbGFzdGljIHBhY2thZ2luZz8NCg0KVGhlIGRhdGFzZXQgaGFzIGluZm9ybWF0aW9uIGFib3V0IHBhY2thZ2luZywgYnV0IHRoZXJlJ3MgYSBiaXQgb2YgYSBwcm9ibGVtOiBpdCdzIHN0b3JlZCBpbiBzZXZlcmFsIGRpZmZlcmVudCBsYW5ndWFnZXMgKFNwYW5pc2gsIEZyZW5jaCwgYW5kIEVuZ2xpc2gpLiBUaGlzIHRha2VzIG1lc3N5IGRhdGEgdG8gYSB3aG9sZSBuZXcgbGV2ZWwhIFRoZXJlIGlzIG5vIFIgcGFja2FnZSB0byBzZWxlY3RpdmVseSB0cmFuc2xhdGUsIGJ1dCB3aGF0IGlmIHlvdSBjb3VsZCBqdXN0IHdvcmsgd2l0aCB0aGUgbWVzc3kgZGF0YSBkaXJlY3RseT8NCg0KVGhlIHJvb3Qgd29yZCBmb3IgcGxhc3RpYyBpcyBzYW1lIGluIEVuZ2xpc2ggKHBsYXN0aWMpLCBGcmVuY2ggKHBsYXN0aXF1ZSksIGFuZCBTcGFuaXNoIChwbGFzdGljbykuIFRvIGdldCBhIGdlbmVyYWwgaWRlYSBvZiBob3cgbWFueSBvZiB0aGVzZSBmb29kcyBhcmUgcGFja2FnZWQgaW4gcGxhc3RpYywgd2UgY2FuIGxvb2sgdGhyb3VnaCB0aGUgcGFja2FnaW5nIGNvbHVtbiBmb3IgdGhlIHN0cmluZyAicGxhc3RpIi4NCmBgYHtyfQ0KIyBGaW5kIGVudHJpZXMgY29udGFpbmluZyAicGxhc3RpIjogcGxhc3RpYw0KcGxhc3RpYyA8LSBzdHJfZGV0ZWN0KGZvb2QzJHBhY2thZ2luZywgInBsYXN0aSIpDQoNCiMgUHJpbnQgdGhlIHN1bSBvZiBwbGFzdGljDQpwcmludChzdW0ocGxhc3RpYykpDQpgYGANCg0K