What Makes Chocolate Tasty?

We all eat chocolate from time to time, but have you ever considered what makes one kind of chocolate better than another? Is it because of its concentration of cocoa? Does it matter what company makes it or where they are located? Look below at a dataset from kaggle.com for data ratings of different chocolates.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rpart); library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart.plot); library(e1071); library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

chocolatedata_raw <- read_csv("chocolatedata.csv")
## Parsed with column specification:
## cols(
##   `CompanyNA
## (Maker-if known)` = col_character(),
##   `Specific Bean Origin
## or Bar Name` = col_character(),
##   REF = col_character(),
##   `Review
## Date` = col_character(),
##   `Cocoa
## Percent` = col_character(),
##   `Company
## Location` = col_character(),
##   Rating = col_character(),
##   `Bean
## Type` = col_character(),
##   `Broad Bean
## Origin` = col_character()
## )
chocolatedata_clean <- read_csv("chocolatedata.csv")
## Parsed with column specification:
## cols(
##   `CompanyNA
## (Maker-if known)` = col_character(),
##   `Specific Bean Origin
## or Bar Name` = col_character(),
##   REF = col_character(),
##   `Review
## Date` = col_character(),
##   `Cocoa
## Percent` = col_character(),
##   `Company
## Location` = col_character(),
##   Rating = col_character(),
##   `Bean
## Type` = col_character(),
##   `Broad Bean
## Origin` = col_character()
## )
write.csv(chocolatedata_raw, file="chocolatedata_raw.csv")
write.csv(chocolatedata_clean, file="chocolatedata_clean.csv")

Project Proposal


In this study, the relationship between several variables will be explored within consumer chocolate ratings. The data collected has information pertaining to company (the maker of the chocolate), bean origin or bar name, review date, cocoa percent, company location, rating, bean type, and broad bean origin. The broad question of interest pertains to how specific qualities of chocolate (origin, company details, cocoa percent, etc.) effect consumer ratings. Does chocolate obtain a higher rating when produced by a company out of Columbia or Madagascar? Does a higher cocoa percent relate to a higher rating? How does company location affect chocolate ratings? After completing this study, the world will recognize what factors contribute to a higher perceived rating of chocolate and where their chocolate is actually coming from.
The higher the percentage of cocoa in chocolate, the more bitter it tastes. For a chocolate that is considered “sweetened,” the highest percent of cocoa that can be used is 84 percent. Normal milk chocolate has only ten percent of cocoa listed in its ingredients list. According to ecolechocolat, “percentage doesn’t let you know if the beans themselves were of good quality.” Even if the bean has a high percentage of cocoa, it doesn’t necessarily mean it tastes better. It is interesting that the data shows that most chocolates tested had a cocoa percentage of over 50.
According to World Atlas, the top eight countries of cocoa beans are Mexico, Ecuador, Brazil, Cameroon, Nigeria, Indonesia, Ghana, and C?te d’Ivoire. Many of these countries are represented in the data, but there are many more listed than what are seen here. Different countries produce the broad (or fava) bean or the specific bean. A broad (or fava) bean is from the flowering plant and usually has more flavor. In addition, many companies that sell the chocolate are located nowhere near where their beans were harvested.
This data comes from Kaggle, a platform with public datasets uploaded by companies and users that has competitions for statisticians and data miners. The data has nine columns and almost 1800 rows. Consequently, this abundance of data can help to obtain quality results.

Data Cleaning Time!

The data imported from Kaggle had lots of factors and information that couldn’t be analyzed too easily. Below, we will convert them to numberic variables and correct spelling and organizational errors.

#Remove first row with repeat labels
chocolatedata_clean = chocolatedata_clean[-1,]
View(chocolatedata_clean)

#Check to make sure there aren't columns with too many NA's
colMeans(is.na(chocolatedata_clean))
##       CompanyNA\n(Maker-if known) Specific Bean Origin\nor Bar Name 
##                        0.00000000                        0.00000000 
##                               REF                      Review\nDate 
##                        0.00000000                        0.00000000 
##                    Cocoa\nPercent                 Company\nLocation 
##                        0.00000000                        0.00000000 
##                            Rating                        Bean\nType 
##                        0.00000000                        0.49470752 
##                Broad Bean\nOrigin 
##                        0.04122563
#Change misspellings
chocolatedata_clean[chocolatedata_clean == "Eucador"] <- "Ecuador"
chocolatedata_clean[chocolatedata_clean == "Domincan Republic"] <- "Dominican Republic"
chocolatedata_clean[chocolatedata_clean == "Niacragua"] <- "Nicaragua"

#Make data numeric
chocolatedata_clean$REF <- as.numeric(as.character(chocolatedata_clean$REF))
chocolatedata_clean$Rating <- as.numeric(as.character(chocolatedata_clean$Rating))
chocolatedata_clean$`Review
Date` <- as.numeric(as.character(chocolatedata_clean$`Review
Date`))

#Make percent into a decimal
chocolatedata_clean$`Cocoa
Percent` <- as.numeric(sub("%", "",chocolatedata_clean$`Cocoa
Percent`,fixed=TRUE))/100
#Assign numbers to factors to make Company Location (loc) numeric
chocolatedata_clean <- chocolatedata_clean %>%
  mutate(loc = recode(
    chocolatedata_clean$`Company
Location`,
      "France" = 1,
      "U.S.A." = 2,
      "Fiji" = 3,
      "Ecuador" = 4,
      "Mexico" = 5,
      "Switzerland" = 6,
      "Netherlands" = 7,
      "Spain" = 8,
      "Peru" = 9,
      "Canada" = 10,
      "Italy" = 11,
      "Brazil" = 12,
      "U.K." = 13,
      "Australia" = 14,
      "Wales" = 15,
      "Belgium"= 16,
      "Germany"= 17,
      "Russia"= 18,
      "Puerto Rico"= 19,
      "Venezuela"=20,
      "Columbia"=21,
      "Japan"=22,
      "New Zealand"=23,
      "Costa Rico"=24,
      "South Korea"=25,
      "Amsterdam"=26,
      "Scotland"=27,
      "Martinique"=28,
      "Sao Tome"=29,
      "Argentina"=30,
      "Guatemala"=31,
      "South Africa"=32,
      "Bolivia"=33,
      "St. Lucia"=34,
      "Portugal"=35,
      "Singapore"=36,
      "Vietnam"=37,
      "Grenada"=38,
      "Israel"=39,
      "India"=40,
      "Czech Republic"=41,
      "Dominican Republic"=42,
      "Finland"=43,
      "Madagascar"=44,
      "Philippines"=45,
      "Sweden"=46,
      "Poland"=47,
      "Austria"=48,
      "Honduras"=49,
      "Nicaragua"=50,
      "Lithuania"=51,
      "Chile"=52,
      "Ghana"=53,
      "Iceland"=54,
      "Hungary"=55,
      "Denmark"=56,
      "Suriname"=57,
      "Ireland"=58
  ))
#Assign numbers to factors to make Bean Type (beantype) numeric
chocolatedata_clean <- chocolatedata_clean %>%
  mutate(beantype = recode(
    chocolatedata_clean$`Bean
Type`,
      "Amazon"=1,
      "Amazon mix"=2,
      "Amazon, ICS"=3,
      "Beniano"=4,
      "Blend"=5,
      "Blend-Forastero,Criollo"=6,
      "CCN51"=7,
      "ciol"=8,
      "ciol (Arriba)"=9,
      "Criollo"=10,
      "Criollo (Amarru)"=11,
      "Criollo (Ocumare 61)"=12,
      "Criollo (Ocumare 67)"=13,
      "Criollo (Ocumare 77)"=14,
      "Criollo (Ocumare)"=15,
      "Criollo (Porcela)"=16,
      "Criollo (Wild)"=17,
      "Criollo, +"=18,
      "Criollo, Forastero"=19,
      "Criollo, Trinitario"=20,
      "EET"=21,
      "Forastero"=22,
      "Forastero (Amelodo)"=23,
      "Forastero (Arriba)"=24,
      "Forastero (Arriba) ASS"=25,
      "Forastero (Arriba) ASSS"=26,
      "Forastero (Catongo)"=27,
      "Forastero (ciol)"=28,
      "Forastero (Parazinho)"=29,
      "Forastero(Arriba, CCN)"=30,
      "Forastero, Trinitario"=31,
      "Mati"=32,
      "Trinitario"=33,
      "Trinitario (85% Criollo)"=34,
      "Trinitario (Amelodo)"=35,
      "Trinitario (Scavi)"=36,
      "Trinitario, ciol"=37,
      "Trinitario, Criollo"=38,
      "Trinitario, Forastero"=39,
      "Trinitario, TCGA"=40
  ))
#Assign numbers to factors to make Broad Bean Origin (bborigin) numeric
chocolatedata_clean <- chocolatedata_clean %>%
  mutate(bborigin = recode(
    chocolatedata_clean$`Broad Bean
Origin`,
      "Africa, Carribean, C. Am."=1,
      "Australia"=2,
      "Belize"=3,
      "Bolivia"=4,
      "Brazil"=5,
      "Burma"=6,
      "Cameroon"=7,
      "Carribean"=8,
      "Carribean(DR/Jam/Tri)"=9,
      "Central and S. America"=10,
      "Colombia"=11,
      "Colombia, Ecuador"=12,
      "Congo"=13,
      "Cost Rica, Ven"=14,
      "Costa Rica"=15,
      "Cuba"=16,
      "Dom. Rep., Madagascar"=17,
      "Domincan Republic"=18,
      "Dominican Rep., Bali"=19,
      "Dominican Republic"=20,
      "DR, Ecuador, Peru"=21,
      "Ecuador"=22,
      "Ecuador, Costa Rica"=23,
      "Ecuador, Mad., PNG"=24,
      "El Salvador"=25,
      "Fiji"=26,
      "Gabon"=27,
      "Ghana"=28,
      "GhaNA& Madagascar"=29,
      "Ghana, Domin. Rep"=30,
      "Ghana, Pama, Ecuador"=31,
      "Gre., PNG, Haw., Haiti, Mad"=32,
      "Greda"=33,
      "Guatemala"=34,
      "Haiti"=35,
      "Hawaii"=36,
      "Honduras"=37,
      "India"=38,
      "Indonesia"=39,
      "Indonesia, Gha"=40,
      "Ivory Coast"=41,
      "Jamaica"=42,
      "Liberia"=43,
      "Mad., Java, PNG"=44,
      "Madagascar"=45,
      "Madagascar & Ecuador"=46,
      "Malaysia"=47,
      "Martinique"=48,
      "Mexico"=49,
      "Nicaragua"=50,
      "Nigeria"=51,
      "Pama"=52,
      "Papua New Guinea"=53,
      "Peru"=54,
      "Peru(SMartin,Pangoa,ciol)"=55,
      "Peru, Belize"=56,
      "Peru, Dom. Rep"=57,
      "Peru, Ecuador"=58,
      "Peru, Ecuador, Venezuela"=59,
      "Peru, Mad., Dom. Rep."=60,
      "Peru, Madagascar"=61,
      "Philippines"=62,
      "PNG, Vanuatu, Mad"=63,
      "Principe"=64,
      "Puerto Rico"=65,
      "Samoa"=66,
      "Sao Tome"=67,
      "Sao Tome & Principe"=68,
      "Solomon Islands"=69,
      "South America"=70,
      "South America, Africa"=71,
      "Sri Lanka"=72,
      "St. Lucia"=73,
      "Suriname"=74,
      "Tanzania"=75,
      "Tobago"=76,
      "Togo"=77,
      "Trinidad"=78,
      "Trinidad, Ecuador"=79,
      "Trinidad, Tobago"=80,
      "Trinidad-Tobago"=81,
      "Uganda"=82,
      "Vanuatu"=83,
      "Ven, Bolivia, D.R."=84,
      "Ven, Trinidad, Ecuador"=85,
      "Ven., Indonesia, Ecuad."=86,
      "Ven., Trinidad, Mad."=87,
      "Ven.,Ecu.,Peru,Nic."=88,
      "Venez,Africa,Brasil,Peru,Mex"=89,
      "Venezuela"=90,
      "Venezuela, Carribean"=91,
      "Venezuela, Dom. Rep."=92,
      "Venezuela, Gha"=93,
      "Venezuela, Java"=94,
      "Venezuela, Trinidad"=95,
      "Venezuela/ Gha"=96,
      "Vietnam"=97,
      "West Africa"=98,
      "Guat., D.R., Peru, Mad., PNG"=99
  ))
#Assign numbers to factors to make Companies/Maker (companies) numeric
chocolatedata_clean <- chocolatedata_clean %>%
  mutate(companies = recode(
    chocolatedata_clean$`CompanyNA
(Maker-if known)`,
      "A. Morin"=1,
      "Acalli"=2,
      "Adi"=3,
      "Aequare (Gianduja)"=3,
      "Ah Cacao"=4,
      "Akesson's (Pralus)"=5,
      "Alain Ducasse"=6,
      "Alexandre"=7,
      "Altus aka Cao Artisan"=8,
      "Amano"=9,
      "Amatller (Simon Coll)"=10,
      "Amazona"=11,
      "Ambrosia"=12,
      "Amedei"=13,
      "AMMA"=14,
      "Anahata"=15,
      "Animas"=16,
      "Ara"=17,
      "Arete"=18,
      "Artisan du Chocolat"=19,
      "Artisan du Chocolat (Casa Luker)"=20,
      "Askinosie"=21,
      "Bahen & Co."=22,
      "Bakau"=23,
      "Bar Au Chocolat"=24,
      "Baravelli's"=25,
      "Batch"=26,
      "Beau Cacao"=27,
      "Beehive"=28,
      "Belcolade"=29,
      "Bellflower"=30,
      "Belyzium"=31,
      "Benoit Nihant"=32,
      "Bernachon"=33,
      "Beschle (Felchlin)"=34,
      "Bisou"=35,
      "Bittersweet Origins"=36,
      "Black Mountain"=37,
      "Black River (A. Morin)"=38,
      "Blanxart"=39,
      "Blue Bandana"=40,
      "Bonnat"=41,
      "Bouga Cacao (Tulicorp)"=42,
      "Bowler Man"=43,
      "Brasstown aka It's Chocolate"=44,
      "Brazen"=45,
      "Breeze Mill"=46,
      "Bright"=47,
      "Britarev"=48,
      "Bronx Grrl Chocolate"=49,
      "Burnt Fork Bend"=50,
      "Cacao Arabuco"=51,
      "Cacao Atlanta"=52,
      "Cacao Barry"=53,
      "Cacao de Origen"=54,
      "Cacao de Origin"=55,
      "Cacao Hunters"=56,
      "Cacao Market"=57,
      "Cacao Prieto"=58,
      "Cacao Sampaka"=59,
      "Cacao Store"=60,
      "Cacaosuyo (Theobroma Inversiones)"=61,
      "Cacaoyere (Ecuatoriana)"=62,
      "Callebaut"=63,
      "C-Amaro"=64,
      "Cao"=65,
      "Caoni (Tulicorp)"=66,
      "Captain Pembleton"=67,
      "Caribeans"=68,
      "Carlotta Chocolat"=69,
      "Castronovo"=70,
      "Cello"=71,
      "Cemoi"=72,
      "Chaleur B"=73,
      "Charm School"=74,
      "Chchukululu (Tulicorp)"=75,
      "Chequessett"=76,
      "Chloe Chocolat"=77,
      "Chocablog"=78,
      "Choco Del Sol"=79,
      "Choco Dong"=80,
      "Chocolarder"=81,
      "Chocola'te"=82,
      "Chocolate Alchemist-Philly"=83,
      "Chocolate Con Amor"=84,
      "Chocolate Conspiracy"=85,
      "Chocolate Makers"=86,
      "Chocolate Tree, The"=87,
      "Chocolats Privilege"=88,
      "ChocoReko"=89,
      "Chocovic"=90,
      "Chocovivo"=91,
      "Choklat"=92,
      "Chokolat Elot (Girard)"=93,
      "Choocsol"=94,
      "Christopher Morel (Felchlin)"=95,
      "Chuao Chocolatier"=96,
      "Chuao Chocolatier (Pralus)"=97,
      "Claudio Corallo"=98,
      "Cloudforest"=99,
      "Coleman & Davis"=100,
      "Compania de Chocolate (Salgado)"=101,
      "Condor"=102,
      "Confluence"=103,
      "Coppeneur"=104,
      "Cote d' Or (Kraft)"=105,
      "Cravve"=106,
      "Creo"=107,
      "Daintree"=108,
      "Dalloway"=109,
      "Damson"=110,
      "Dandelion"=111,
      "Danta"=112,
      "DAR"=113,
      "Dark Forest"=114,
      "Davis"=115,
      "De Mendes"=116,
      "De Villiers"=117,
      "Dean and Deluca (Belcolade)"=118,
      "Debauve & Gallais (Michel Cluizel)"=119,
      "Desbarres"=120,
      "DeVries"=121,
      "Dick Taylor"=122,
      "Doble & Bignall"=123,
      "Dole (Guittard)"=124,
      "Dolfin (Belcolade)"=125,
      "Domori"=126,
      "Dormouse"=127,
      "Duffy's"=128,
      "Dulcinea"=129,
      "Durand"=130,
      "Durci"=131,
      "East Van Roasters"=132,
      "Eau de Rose"=133,
      "Eclat (Felchlin)"=134,
      "Edelmond"=135,
      "El Ceibo"=136,
      "El Rey"=137,
      "Emerald Estate"=138,
      "Emily's"=139,
      "ENNA"=140,
      "Enric Rovira (Claudio Corallo)"=141,
      "Erithaj (A. Morin)"=142,
      "Escazu"=143,
      "Ethel's Artisan (Mars)"=144,
      "Ethereal"=145,
      "Fearless (AMMA)"=146,
      "Feitoria Cacao"=147,
      "Felchlin"=148,
      "Finca"=149,
      "Forever Cacao"=150,
      "Forteza (Cortes)"=151,
      "Fossa"=152,
      "Franceschi"=153,
      "Frederic Blondeel"=154,
      "French Broad"=155,
      "Fresco"=156,
      "Friis Holm"=157,
      "Friis Holm (Bonnat)"=158,
      "Fruition"=159,
      "Garden Island"=160,
      "Georgia Ramon"=161,
      "Glennmade"=162,
      "Goodnow Farms"=163,
      "Grand Place"=164,
      "Green & Black's (ICAM)"=165,
      "Green Bean to Bar"=166,
      "Grenada Chocolate Co."=167,
      "Guido Castagna"=168,
      "Guittard"=169,
      "Habitual"=170,
      "Hachez"=171,
      "Hacienda El Castillo"=172,
      "Haigh"=173,
      "Harper Macaw"=174,
      "Heilemann"=175,
      "Heirloom Cacao Preservation (Brasstown)"=176,
      "Heirloom Cacao Preservation (Fruition)"=177,
      "Heirloom Cacao Preservation (Guittard)"=178,
      "Heirloom Cacao Preservation (Manoa)"=179,
      "Heirloom Cacao Preservation (Millcreek)"=180,
      "Heirloom Cacao Preservation (Mindo)"=181,
      "Heirloom Cacao Preservation (Zokoko)"=182,
      "hello cocoa"=183,
      "hexx"=184,
      "Hogarth"=185,
      "Hoja Verde (Tulicorp)"=186,
      "Holy Cacao"=187,
      "Honest"=188,
      "Hotel Chocolat"=189,
      "Hotel Chocolat (Coppeneur)"=190,
      "Hummingbird"=191,
      "Idilio (Felchlin)"=192,
      "Indah"=193,
      "Indaphoria"=194,
      "Indi"=195,
      "iQ Chocolate"=196,
      "Isidro"=197,
      "Izard"=198,
      "Jacque Torres"=199,
      "Jordis"=200,
      "Just Good Chocolate"=201,
      "Kah Kow"=202,
      "Kakao"=203,
      "Kallari (Ecuatoriana)"=204,
      "Kaoka (Cemoi)"=205,
      "Kerchner"=206,
      "Ki' Xocolatl"=207,
      "Kiskadee"=208,
      "Kto"=209,
      "K'ul"=210,
      "Kyya"=211,
      "L.A. Burdick (Felchlin)"=212,
      "La Chocolaterie Nanairo"=213,
      "La Maison du Chocolat (Valrhona)"=214,
      "La Oroquidea"=215,
      "La Pepa de Oro"=216,
      "Laia aka Chat-Noir"=217,
      "Lajedo do Ouro"=218,
      "Lake Champlain (Callebaut)"=219,
      "L'Amourette"=220,
      "Letterpress"=221,
      "Levy"=222,
      "Lilla"=223,
      "Lillie Belle"=224,
      "Lindt & Sprungli"=225,
      "Loiza"=226,
      "Lonohana"=227,
      "Love Bar"=228,
      "Luker"=229,
      "Machu Picchu Trading Co."=230,
      "Madecasse (Cinagra)"=231,
      "Madre"=232,
      "Maglio"=233,
      "Majani"=234,
      "Malagasy (Chocolaterie Robert)"=235,
      "Malagos"=236,
      "Malie Kai (Guittard)"=237,
      "Malmo"=238,
      "Mana"=239,
      "Manifesto Cacao"=240,
      "Manoa"=241,
      "Manufaktura Czekolady"=242,
      "Map Chocolate"=243,
      "Marana"=244,
      "Marigold's Finest"=245,
      "Marou"=246,
      "Mars"=247,
      "Marsatta"=248,
      "Martin Mayer"=249,
      "Mast Brothers"=250,
      "Matale"=251,
      "Maverick"=252,
      "Mayacama"=253,
      "Meadowlands"=254,
      "Menakao (aka Cinagra)"=255,
      "Mesocacao"=256,
      "Metiisto"=257,
      "Metropolitan"=258,
      "Michel Cluizel"=259,
      "Middlebury"=260,
      "Millcreek Cacao Roasters"=261,
      "Mindo"=262,
      "Minimal"=263,
      "Mission"=264,
      "Mita"=265,
      "Moho"=266,
      "Molucca"=267,
      "Momotombo"=268,
      "Monarque"=269,
      "Monsieur Truffe"=270,
      "Montecristi"=271,
      "Muchomas (Mesocacao)"=272,
      "Mutari"=273,
      "Nahua"=274,
      "Naive"=275,
      "Na�ve"=276,
      "Nanea"=277,
      "Nathan Miller"=278,
      "Neuhaus (Callebaut)"=279,
      "Nibble"=280,
      "Night Owl"=281,
      "Noble Bean aka Jerjobo"=282,
      "Noir d' Ebine"=283,
      "Nova Monda"=284,
      "Nuance"=285,
      "Nugali"=286,
      "Oakland Chocolate Co."=287,
      "Obolo"=288,
      "Ocelot"=289,
      "Ocho"=290,
      "Ohiyo"=291,
      "Oialla by Bojessen (Malmo)"=292,
      "Olive and Sinclair"=293,
      "Olivia"=294,
      "Omanhene"=295,
      "Omnom"=296,
      "organicfair"=297,
      "Original Beans (Felchlin)"=298,
      "Original Hawaiin Chocolate Factory"=299,
      "Orquidea"=300,
      "Pacari"=301,
      "Palette de Bine"=302,
      "Pangea"=303,
      "Park 75"=304,
      "Parliament"=305,
      "Pascha"=306,
      "Patric"=307,
      "Paul Young"=308,
      "Peppalo"=309,
      "Pierre Marcolini"=310,
      "Pinellas"=311,
      "Pitch Dark"=312,
      "Pomm (aka Dead Dog)"=313,
      "Potomac"=314,
      "Pralus"=315,
      "Pump Street Bakery"=316,
      "Pura Delizia"=317,
      "Q Chocolate"=318,
      "Quetzalli (Wolter)"=319,
      "Raaka"=320,
      "Rain Republic"=321,
      "Rancho San Jacinto"=322,
      "Ranger"=323,
      "Raoul Boulanger"=324,
      "Raw Cocoa"=325,
      "Republica del Cacao (aka Confecta)"=326,
      "Ritual"=327,
      "Roasting Masters"=328,
      "Robert (aka Chocolaterie Robert)"=329,
      "Rococo (Grenada Chocolate Co.)"=330,
      "Rogue"=331,
      "Rozsavolgyi"=332,
      "S.A.I.D."=333,
      "Sacred"=334,
      "Salgado"=335,
      "Santander (Compania Nacional)"=336,
      "Santome"=337,
      "Scharffen Berger"=338,
      "Seaforth"=339,
      "Shark Mountain"=340,
      "Shark's"=341,
      "Shattel"=342,
      "Shattell"=343,
      "Sibu"=344,
      "Sibu Sura"=345,
      "Silvio Bessone"=346,
      "Sirene"=347,
      "Sjolinds"=348,
      "Smooth Chocolator, The"=349,
      "Snake & Butterfly"=350,
      "Sol Cacao"=351,
      "Solkiki"=352,
      "Solomons Gold"=353,
      "Solstice"=354,
      "Soma"=355,
      "Somerville"=356,
      "Soul"=357,
      "Spagnvola"=358,
      "Spencer"=359,
      "Sprungli (Felchlin)"=360,
      "SRSLY"=361,
      "Starchild"=362,
      "Stella (aka Bernrain)"=363,
      "Stone Grindz"=364,
      "StRita Supreme"=365,
      "Sublime Origins"=366,
      "Summerbird"=367,
      "Suruca Chocolate"=368,
      "Svenska Kakaobolaget"=369,
      "Szanto Tibor"=370,
      "Tabal"=371,
      "Tablette (aka Vanillabeans)"=372,
      "Tan Ban Skrati"=373,
      "Taza"=374,
      "TCHO"=375,
      "Tejas"=376,
      "Terroir"=377,
      "The Barn"=378,
      "Theo"=379,
      "Theobroma"=380,
      "Timo A. Meyer"=381,
      "To'ak (Ecuatoriana)"=382,
      "Tobago Estate (Pralus)"=383,
      "Tocoti"=384,
      "Treehouse"=385,
      "Tsara (Cinagra)"=386,
      "twenty-four blackbirds"=387,
      "Two Ravens"=388,
      "Un Dimanche A Paris"=389,
      "Undone"=390,
      "Upchurch"=391,
      "Urzi"=392,
      "Valrhona"=393,
      "Vanleer (Barry Callebaut)"=394,
      "Vao Vao (Chocolaterie Robert)"=395,
      "Vicuna"=396,
      "Videri"=397,
      "Vietcacao (A. Morin)"=398,
      "Vintage Plantations"=399,
      "Vintage Plantations (Tulicorp)"=400,
      "Violet Sky"=401,
      "Vivra"=402,
      "Wellington Chocolate Factory"=403,
      "Whittakers"=404,
      "Wilkie's Organic"=405,
      "Willie's Cacao"=406,
      "Wm"=407,
      "Woodblock"=408,
      "Xocolat"=409,
      "Xocolla"=410,
      "Zak's"=411,
      "Zart Pralinen"=412,
      "Zokoko"=413,
      "Zotter"=414
  ))
#TRY TO DIVIDE COUNTRIES BY CONTINENT
chocolatedata_clean <- chocolatedata_clean %>%
  mutate(loc_continent = recode(
    chocolatedata_clean$`Company
Location`,
    "France" = 1,
    "U.S.A." = 5,
    "Fiji" = 4,
    "Ecuador" = 6,
    "Mexico" = 5,
    "Switzerland" = 1,
    "Netherlands" = 1,
    "Spain" = 1,
    "Peru" = 6,
    "Canada" = 5,
    "Italy" = 1,
    "Brazil" = 6,
    "U.K." = 1,
    "Australia" = 4,
    "Wales" = 1,
    "Belgium"= 1,
    "Germany"= 1,
    "Russia"= 3,
    "Puerto Rico"= 6,
    "Venezuela"=6,
    "Columbia"=6,
    "Japan"=3,
    "New Zealand"=4,
    "Costa Rico"=6,
    "South Korea"=3,
    "Amsterdam"=1,
    "Scotland"=1,
    "Martinique"=6,
    "Sao Tome"=2,
    "Argentina"=6,
    "Guatemala"=6,
    "South Africa"=2,
    "Bolivia"=6,
    "St. Lucia"=6,
    "Portugal"=6,
    "Singapore"=3,
    "Vietnam"=3,
    "Grenada"=6,
    "Israel"=2,
    "India"=3,
    "Czech Republic"=1,
    "Dominican Republic"=6,
    "Finland"=1,
    "Madagascar"=2,
    "Philippines"=3,
    "Sweden"=1,
    "Poland"=1,
    "Austria"=1,
    "Honduras"=6,
    "Nicaragua"=6,
    "Lithuania"=1,
    "Chile"=6,
    "Ghana"=2,
    "Iceland"=1,
    "Hungary"=1,
    "Denmark"=1,
    "Suriname"=6,
    "Ireland"=1
  ))
#1 -- Europe; 2 -- Africa; 3 -- Asia; 4 -- Australia; 5 -- North America; 6 -- South America; 7 -- Antarctica

Leave ‘Specific Bean Origin or Bar Name’ alone because respnses are very personalized with few duplicates.

write.csv(chocolatedata_clean, file="chocolatedata_clean.csv")

Tidy Data Explaination


When looking at the raw data set, it looked relatively clean. However, lots of cleaning actually needed to be done. We cleaned up the data in several ways to make it easier to look at, manipulate, and analyze. These changes included making data numeric, changing percents to decimals, and correcting spelling changes.
We started by cleaning up repeated labels in the second row. For some reason when importing the data into R, the first row was duplicated. This row was repetitive and unneeded. Throughout the data there were many “country” names that were spelled incorrectly. We went through and corrected all of them to make it easier to look at the data in the future. When messing with the data, it was hard to remember how the original makers spelled the column heads. We changed these to make it easier for us to remember and not have to look back at the data set every time we want to change or manipulate something.
In our data, there is a column that shows the percentage of cocoa in each type of chocolate. In the raw data, this was shown as a percent. For our project, it is more beneficial to use the decimal. We changed this from the percentage form to a decimal form to make it easier to use throughout our data analysis.
Lastly, there were several groups of characters in the dataset that were not numeric. This makes it very difficult to make graphs and tables in the future. Consequently, we went though and fixed all of these characters to be numeric. When the data was imported into RStudio, the numeric data columns like REF, which is the value linked to when the review was entered into the database (higher means more recent), Rating (expert rating for the chocolate), and the Review date, were changed into characters, which is messy to work with. Therefore, we have to change it to numeric data. This will be very important later on, as the dataset is relatively large and the columns mentioned are valuable. Numeric data can be used to set up graphs to compare specific columns in the data, and if we wouldn’t have, RStudio would have displayed an error message.
This process was long and lengthy. Because there were almost 500 data points in some of the columns, it took forever to type each one out in the recode function in R. For the future, I would like to discover another faster method to do this. Figuring out the syntax for recode brought up lots of issues, but it was eventually figured out. Before landing on recode, we discovered that nested ifelse statements only are allowed when there are less than 50 data points in the data. Now that the data is clean, it will be easier to manipulate.
There were many things that needed to be done to our data to make it usable. Now that we have everything organized, we can run tests on our data and make charts and graphs to prove correlation and causation.

Data Analysis

We have made numerous graphs and charts to study the data we have acquired. Look below to see what we found!

Country Location (loc) on Rating of Chocolate

boxplot(chocolatedata_clean$Rating~chocolatedata_clean$loc, main="Country Location vs. Rating", xlab="Country Location", ylab="Rating", col=chocolatedata_clean$`Review
Date`) #Doesn't show much of an impact on ratings

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `loc`)) +
  geom_jitter() +
  geom_smooth(method = "lm") #Locations have similar rating distributions; see below for breakout by continent

ggplot(data=chocolatedata_clean,
       aes(y = `Rating`, x = `loc`)) +
  geom_bar(stat = "identity")+
  xlab("Country Location") #The U.S. (number 2) has the highest ratings compared to the other country locations

Cocoa Percent on Rating of Chocolate

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = chocolatedata_clean$`Cocoa
Percent`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("Cocoa Percent") #This shows that as the cocoa percent increases, the perceived rating of chocolate decreases significantly.

Bean Type and Rating of Chocolate

boxplot(chocolatedata_clean$Rating~chocolatedata_clean$beantype, main="Bean Type vs. Rating", xlab="Bean Type", ylab="Rating", col=chocolatedata_clean$`Review
Date`) #Specific bean types are grouped by the review date; there doesn't seem to be much correlation between bean type and rating though

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `beantype`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("Bean Type") #Doesn't show much correlation between the two variables

Company and Rating of Chocolate

boxplot(chocolatedata_clean$Rating~chocolatedata_clean$companies, main="Company vs. Rating", xlab="Company", ylab="Rating")

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `companies`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("Companies")

ggplot(data=chocolatedata_clean,
       aes(y = `Rating`, x = `companies`, color=`loc`)) +
  geom_bar(stat = "identity")+
  xlab("Company") #The company with the most highest ratings was Soma.

Broad Bean Origin and Rating of Chocolate

boxplot(chocolatedata_clean$Rating~chocolatedata_clean$bborigin, main="Broad Bean Origin vs. Rating", xlab="BB Origin", ylab="Rating")

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `bborigin`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("Broad Bean Origin") #Doesn't show much of an effect either

REF and Rating of Chocolate

boxplot(chocolatedata_clean$Rating~chocolatedata_clean$REF, main="REF vs. Rating", xlab="REF", ylab="Rating")

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `REF`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("REF") #As REF goes up, the rating of the chocolate goes up as well

Continent and Rating of Chocolate

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `loc_continent`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("Continent") #didn't show too much--boxplots are better

ggplot(chocolatedata_clean,
       aes(y = `Rating`, x = `loc_continent`)) +
  geom_boxplot() +
  facet_wrap(~`loc_continent`)

These boxplots show that Europe, Africa, Asia, Australia, North America, and South America had similar ratings, but differ in range and interquartile range. Europe has more outliers than the other plots and has a mean of around 3.25. Africa has less outliers and has a mean of around 3.12. Its interquartile range is larger at around 0.75. Asia has no outliers, a mean of 3.25 (same as Europe), and a small IQR. Australia has the highest mean at 3.45 and has a range of 1.5. North America and South America are extremely similar. They have means of 3.3 and one outlier. In conclusion, Austrailia had the highest mean ratings at 3.45, but also had the fewest data points available to analyze.

Discussion – Part 1

This data didn’t have as many correlations as we had hoped. The two biggest factors we found to affect chocolate rating were the chocolate’s cocoa percent and the chocolate’s REF. As we hypothesized, the higher the coca percent in the chocolate, the lower rating it got. This is because of the bitterness gained from the high concentration of the cocoa bean.

In addition, we found that the REF had an influence on the perceived rating of the chocolate. As the REF went up, the rating also went up. The REF is the reference number used for each rating. This could be caused by a couple of reasons. More people could have been craving chocolate in the more recent tests and rated it higher because of this. The U.S. has the highest number of ratings collected of the countries studied. Soma had the highest number of ratings collected of the companies studied. For the future, we should run a regression model to see if we can find anything else. Below you will find a regression model plus trees dealing with rating on several other variables.

Regression

#Predict the rating of the chocolate
chocolatereg <- chocolatedata_clean [c(3, 5, 7, 10, 11, 12, 13, 14)]

set.seed(5678)
trainIndex <- createDataPartition(chocolatereg$Rating, p = .65, list = FALSE)
chocTrain <- chocolatereg[ trainIndex,]
chocTest  <- chocolatereg[-trainIndex,]
choc.lm <- lm(Rating ~ ., chocTrain, na.action=na.omit)
choc.lm
## 
## Call:
## lm(formula = Rating ~ ., data = chocTrain, na.action = na.omit)
## 
## Coefficients:
##       (Intercept)                REF  `Cocoa\\nPercent`  
##         4.380e+00          7.908e-05         -1.546e+00  
##               loc           beantype           bborigin  
##        -1.594e-03         -1.324e-03          2.071e-04  
##         companies      loc_continent  
##         8.026e-05         -2.655e-02
choc.lm.predict <- round_any(predict(choc.lm, chocTrain, na.action=na.pass),.25)
## Error in round_any(predict(choc.lm, chocTrain, na.action = na.pass), 0.25): could not find function "round_any"
table(chocTrain$Rating, choc.lm.predict, useNA = "always" )
## Error in table(chocTrain$Rating, choc.lm.predict, useNA = "always"): object 'choc.lm.predict' not found
choc.lm.predict2 <- round(choc.lm.predict, 0)
## Error in eval(expr, envir, enclos): object 'choc.lm.predict' not found
table(choc.lm.predict2) #Got it.
## Error in table(choc.lm.predict2): object 'choc.lm.predict2' not found
table(chocTrain$Rating, choc.lm.predict2, useNA = "always" )
## Error in table(chocTrain$Rating, choc.lm.predict2, useNA = "always"): object 'choc.lm.predict2' not found
plot(jitter(chocTrain$Rating,1), jitter(choc.lm.predict2,1), pch=20)
## Error in jitter(choc.lm.predict2, 1): object 'choc.lm.predict2' not found
chocTr.lm.predict <- predict(choc.lm, newdata=chocTest, na.action=na.pass)
plot(jitter(chocTest$Rating,.5), jitter(chocTr.lm.predict,.5))

chocTr.lm.predict2 <- round(chocTr.lm.predict, digits=0)
table(chocTr.lm.predict2)
## chocTr.lm.predict2
##   3 
## 272
table( chocTr.lm.predict2, chocTest$Rating, useNA = "always" )
##                   
## chocTr.lm.predict2 1.5 1.75  2 2.25 2.5 2.75  3 3.25 3.5 3.75  4  5 <NA>
##               3      2    1  7    2  12   36 62   31  65   30 24  0    0
##               <NA>   2    0  7    5  27   56 60   72  72   33 20  1    0
choc.rpart <- rpart(Rating~.,chocolatereg, maxdepth=8, na.action=na.rpart) #na.action takes out if important
## Error in `[.data.frame`(m, labs): undefined columns selected
choc.rpart      # The output is a bit confusing
## Error in eval(expr, envir, enclos): object 'choc.rpart' not found
choc.rpart <- rpart(Rating~ chocolatereg$`Cocoa
Percent` + chocolatereg$loc_continent, data=chocolatereg, method="class")
rpart.plot(choc.rpart, digits=3, cex=.7, extra=2, under=TRUE)

prp(choc.rpart, digits=3, extra=101, cex=.7, box.palette = "PuBu")  #cooler

rpart.plot(choc.rpart, digits=3, extra=101, cex=.7, box.palette = "PuBu")

summary(choc.rpart)
## Call:
## rpart(formula = Rating ~ chocolatereg$`Cocoa\nPercent` + chocolatereg$loc_continent, 
##     data = chocolatereg, method = "class")
##   n= 1795 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.01069138      0 1.0000000 1.000000 0.01247619
## 2 0.01000000      1 0.9893086 1.005702 0.01238338
## 
## Variable importance
## chocolatereg$`Cocoa\\nPercent` 
##                            100 
## 
## Node number 1: 1795 observations,    complexity param=0.01069138
##   predicted class=3.5   expected loss=0.7816156  P(node) =1
##     class counts:     4    10     3    32    14   127   259   341   303   392   210    98     2
##    probabilities: 0.002 0.006 0.002 0.018 0.008 0.071 0.144 0.190 0.169 0.218 0.117 0.055 0.001 
##   left son=2 (184 obs) right son=3 (1611 obs)
##   Primary splits:
##       chocolatereg$`Cocoa\nPercent` < 0.785 to the right, improve=4.946878, (0 missing)
##       chocolatereg$loc_continent     < 1.5   to the right, improve=2.064035, (32 missing)
## 
## Node number 2: 184 observations
##   predicted class=2.75  expected loss=0.7717391  P(node) =0.102507
##     class counts:     1     6     2    10     2    17    42    38    31    27     6     2     0
##    probabilities: 0.005 0.033 0.011 0.054 0.011 0.092 0.228 0.207 0.168 0.147 0.033 0.011 0.000 
## 
## Node number 3: 1611 observations
##   predicted class=3.5   expected loss=0.7734327  P(node) =0.897493
##     class counts:     3     4     1    22    12   110   217   303   272   365   204    96     2
##    probabilities: 0.002 0.002 0.001 0.014 0.007 0.068 0.135 0.188 0.169 0.227 0.127 0.060 0.001

Discussion – Part 2

This regression model showed many things about our chocolate data. We first selected the variables that were of importance to us. Next, we established a testing and training set to work with. We ran predictions and lms and found how many people and the distribution of individuals who responded each of the ratings. Next, we plotted our rounded data and it lookes fairly normally distributed with most of the ratings being at 3 and spreading out from there. We then ploted our testing data. The scatter looks like our data was fairly well predicted. After making a table of our test data, it also shows that most data was around 3.5 and higher. Most chocolate wasn’t scored lowly. Lastly, we made some trees. If the cocoa percent was above 0.785, the rating was .75 lower than if it was lower than 0.785. The next trees show predictions for the other variables given cocoa percent.

From running the summary we found that cocoa percent improved the prediction by 4.95 and continent improved the prediction by 2.06. Our data has shown us many things about chocolate, but mainly that it is more complex than we thought. Many variables play a factor in the rating of chocolate, including cocoa percent, REF, country, and continent.