We all eat chocolate from time to time, but have you ever considered what makes one kind of chocolate better than another? Is it because of its concentration of cocoa? Does it matter what company makes it or where they are located? Look below at a dataset from kaggle.com for data ratings of different chocolates.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rpart); library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart.plot); library(e1071); library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
chocolatedata_raw <- read_csv("chocolatedata.csv")
## Parsed with column specification:
## cols(
## `CompanyNA
## (Maker-if known)` = col_character(),
## `Specific Bean Origin
## or Bar Name` = col_character(),
## REF = col_character(),
## `Review
## Date` = col_character(),
## `Cocoa
## Percent` = col_character(),
## `Company
## Location` = col_character(),
## Rating = col_character(),
## `Bean
## Type` = col_character(),
## `Broad Bean
## Origin` = col_character()
## )
chocolatedata_clean <- read_csv("chocolatedata.csv")
## Parsed with column specification:
## cols(
## `CompanyNA
## (Maker-if known)` = col_character(),
## `Specific Bean Origin
## or Bar Name` = col_character(),
## REF = col_character(),
## `Review
## Date` = col_character(),
## `Cocoa
## Percent` = col_character(),
## `Company
## Location` = col_character(),
## Rating = col_character(),
## `Bean
## Type` = col_character(),
## `Broad Bean
## Origin` = col_character()
## )
write.csv(chocolatedata_raw, file="chocolatedata_raw.csv")
write.csv(chocolatedata_clean, file="chocolatedata_clean.csv")
The data imported from Kaggle had lots of factors and information that couldn’t be analyzed too easily. Below, we will convert them to numberic variables and correct spelling and organizational errors.
#Remove first row with repeat labels
chocolatedata_clean = chocolatedata_clean[-1,]
View(chocolatedata_clean)
#Check to make sure there aren't columns with too many NA's
colMeans(is.na(chocolatedata_clean))
## CompanyNA\n(Maker-if known) Specific Bean Origin\nor Bar Name
## 0.00000000 0.00000000
## REF Review\nDate
## 0.00000000 0.00000000
## Cocoa\nPercent Company\nLocation
## 0.00000000 0.00000000
## Rating Bean\nType
## 0.00000000 0.49470752
## Broad Bean\nOrigin
## 0.04122563
#Change misspellings
chocolatedata_clean[chocolatedata_clean == "Eucador"] <- "Ecuador"
chocolatedata_clean[chocolatedata_clean == "Domincan Republic"] <- "Dominican Republic"
chocolatedata_clean[chocolatedata_clean == "Niacragua"] <- "Nicaragua"
#Make data numeric
chocolatedata_clean$REF <- as.numeric(as.character(chocolatedata_clean$REF))
chocolatedata_clean$Rating <- as.numeric(as.character(chocolatedata_clean$Rating))
chocolatedata_clean$`Review
Date` <- as.numeric(as.character(chocolatedata_clean$`Review
Date`))
#Make percent into a decimal
chocolatedata_clean$`Cocoa
Percent` <- as.numeric(sub("%", "",chocolatedata_clean$`Cocoa
Percent`,fixed=TRUE))/100
#Assign numbers to factors to make Company Location (loc) numeric
chocolatedata_clean <- chocolatedata_clean %>%
mutate(loc = recode(
chocolatedata_clean$`Company
Location`,
"France" = 1,
"U.S.A." = 2,
"Fiji" = 3,
"Ecuador" = 4,
"Mexico" = 5,
"Switzerland" = 6,
"Netherlands" = 7,
"Spain" = 8,
"Peru" = 9,
"Canada" = 10,
"Italy" = 11,
"Brazil" = 12,
"U.K." = 13,
"Australia" = 14,
"Wales" = 15,
"Belgium"= 16,
"Germany"= 17,
"Russia"= 18,
"Puerto Rico"= 19,
"Venezuela"=20,
"Columbia"=21,
"Japan"=22,
"New Zealand"=23,
"Costa Rico"=24,
"South Korea"=25,
"Amsterdam"=26,
"Scotland"=27,
"Martinique"=28,
"Sao Tome"=29,
"Argentina"=30,
"Guatemala"=31,
"South Africa"=32,
"Bolivia"=33,
"St. Lucia"=34,
"Portugal"=35,
"Singapore"=36,
"Vietnam"=37,
"Grenada"=38,
"Israel"=39,
"India"=40,
"Czech Republic"=41,
"Dominican Republic"=42,
"Finland"=43,
"Madagascar"=44,
"Philippines"=45,
"Sweden"=46,
"Poland"=47,
"Austria"=48,
"Honduras"=49,
"Nicaragua"=50,
"Lithuania"=51,
"Chile"=52,
"Ghana"=53,
"Iceland"=54,
"Hungary"=55,
"Denmark"=56,
"Suriname"=57,
"Ireland"=58
))
#Assign numbers to factors to make Bean Type (beantype) numeric
chocolatedata_clean <- chocolatedata_clean %>%
mutate(beantype = recode(
chocolatedata_clean$`Bean
Type`,
"Amazon"=1,
"Amazon mix"=2,
"Amazon, ICS"=3,
"Beniano"=4,
"Blend"=5,
"Blend-Forastero,Criollo"=6,
"CCN51"=7,
"ciol"=8,
"ciol (Arriba)"=9,
"Criollo"=10,
"Criollo (Amarru)"=11,
"Criollo (Ocumare 61)"=12,
"Criollo (Ocumare 67)"=13,
"Criollo (Ocumare 77)"=14,
"Criollo (Ocumare)"=15,
"Criollo (Porcela)"=16,
"Criollo (Wild)"=17,
"Criollo, +"=18,
"Criollo, Forastero"=19,
"Criollo, Trinitario"=20,
"EET"=21,
"Forastero"=22,
"Forastero (Amelodo)"=23,
"Forastero (Arriba)"=24,
"Forastero (Arriba) ASS"=25,
"Forastero (Arriba) ASSS"=26,
"Forastero (Catongo)"=27,
"Forastero (ciol)"=28,
"Forastero (Parazinho)"=29,
"Forastero(Arriba, CCN)"=30,
"Forastero, Trinitario"=31,
"Mati"=32,
"Trinitario"=33,
"Trinitario (85% Criollo)"=34,
"Trinitario (Amelodo)"=35,
"Trinitario (Scavi)"=36,
"Trinitario, ciol"=37,
"Trinitario, Criollo"=38,
"Trinitario, Forastero"=39,
"Trinitario, TCGA"=40
))
#Assign numbers to factors to make Broad Bean Origin (bborigin) numeric
chocolatedata_clean <- chocolatedata_clean %>%
mutate(bborigin = recode(
chocolatedata_clean$`Broad Bean
Origin`,
"Africa, Carribean, C. Am."=1,
"Australia"=2,
"Belize"=3,
"Bolivia"=4,
"Brazil"=5,
"Burma"=6,
"Cameroon"=7,
"Carribean"=8,
"Carribean(DR/Jam/Tri)"=9,
"Central and S. America"=10,
"Colombia"=11,
"Colombia, Ecuador"=12,
"Congo"=13,
"Cost Rica, Ven"=14,
"Costa Rica"=15,
"Cuba"=16,
"Dom. Rep., Madagascar"=17,
"Domincan Republic"=18,
"Dominican Rep., Bali"=19,
"Dominican Republic"=20,
"DR, Ecuador, Peru"=21,
"Ecuador"=22,
"Ecuador, Costa Rica"=23,
"Ecuador, Mad., PNG"=24,
"El Salvador"=25,
"Fiji"=26,
"Gabon"=27,
"Ghana"=28,
"GhaNA& Madagascar"=29,
"Ghana, Domin. Rep"=30,
"Ghana, Pama, Ecuador"=31,
"Gre., PNG, Haw., Haiti, Mad"=32,
"Greda"=33,
"Guatemala"=34,
"Haiti"=35,
"Hawaii"=36,
"Honduras"=37,
"India"=38,
"Indonesia"=39,
"Indonesia, Gha"=40,
"Ivory Coast"=41,
"Jamaica"=42,
"Liberia"=43,
"Mad., Java, PNG"=44,
"Madagascar"=45,
"Madagascar & Ecuador"=46,
"Malaysia"=47,
"Martinique"=48,
"Mexico"=49,
"Nicaragua"=50,
"Nigeria"=51,
"Pama"=52,
"Papua New Guinea"=53,
"Peru"=54,
"Peru(SMartin,Pangoa,ciol)"=55,
"Peru, Belize"=56,
"Peru, Dom. Rep"=57,
"Peru, Ecuador"=58,
"Peru, Ecuador, Venezuela"=59,
"Peru, Mad., Dom. Rep."=60,
"Peru, Madagascar"=61,
"Philippines"=62,
"PNG, Vanuatu, Mad"=63,
"Principe"=64,
"Puerto Rico"=65,
"Samoa"=66,
"Sao Tome"=67,
"Sao Tome & Principe"=68,
"Solomon Islands"=69,
"South America"=70,
"South America, Africa"=71,
"Sri Lanka"=72,
"St. Lucia"=73,
"Suriname"=74,
"Tanzania"=75,
"Tobago"=76,
"Togo"=77,
"Trinidad"=78,
"Trinidad, Ecuador"=79,
"Trinidad, Tobago"=80,
"Trinidad-Tobago"=81,
"Uganda"=82,
"Vanuatu"=83,
"Ven, Bolivia, D.R."=84,
"Ven, Trinidad, Ecuador"=85,
"Ven., Indonesia, Ecuad."=86,
"Ven., Trinidad, Mad."=87,
"Ven.,Ecu.,Peru,Nic."=88,
"Venez,Africa,Brasil,Peru,Mex"=89,
"Venezuela"=90,
"Venezuela, Carribean"=91,
"Venezuela, Dom. Rep."=92,
"Venezuela, Gha"=93,
"Venezuela, Java"=94,
"Venezuela, Trinidad"=95,
"Venezuela/ Gha"=96,
"Vietnam"=97,
"West Africa"=98,
"Guat., D.R., Peru, Mad., PNG"=99
))
#Assign numbers to factors to make Companies/Maker (companies) numeric
chocolatedata_clean <- chocolatedata_clean %>%
mutate(companies = recode(
chocolatedata_clean$`CompanyNA
(Maker-if known)`,
"A. Morin"=1,
"Acalli"=2,
"Adi"=3,
"Aequare (Gianduja)"=3,
"Ah Cacao"=4,
"Akesson's (Pralus)"=5,
"Alain Ducasse"=6,
"Alexandre"=7,
"Altus aka Cao Artisan"=8,
"Amano"=9,
"Amatller (Simon Coll)"=10,
"Amazona"=11,
"Ambrosia"=12,
"Amedei"=13,
"AMMA"=14,
"Anahata"=15,
"Animas"=16,
"Ara"=17,
"Arete"=18,
"Artisan du Chocolat"=19,
"Artisan du Chocolat (Casa Luker)"=20,
"Askinosie"=21,
"Bahen & Co."=22,
"Bakau"=23,
"Bar Au Chocolat"=24,
"Baravelli's"=25,
"Batch"=26,
"Beau Cacao"=27,
"Beehive"=28,
"Belcolade"=29,
"Bellflower"=30,
"Belyzium"=31,
"Benoit Nihant"=32,
"Bernachon"=33,
"Beschle (Felchlin)"=34,
"Bisou"=35,
"Bittersweet Origins"=36,
"Black Mountain"=37,
"Black River (A. Morin)"=38,
"Blanxart"=39,
"Blue Bandana"=40,
"Bonnat"=41,
"Bouga Cacao (Tulicorp)"=42,
"Bowler Man"=43,
"Brasstown aka It's Chocolate"=44,
"Brazen"=45,
"Breeze Mill"=46,
"Bright"=47,
"Britarev"=48,
"Bronx Grrl Chocolate"=49,
"Burnt Fork Bend"=50,
"Cacao Arabuco"=51,
"Cacao Atlanta"=52,
"Cacao Barry"=53,
"Cacao de Origen"=54,
"Cacao de Origin"=55,
"Cacao Hunters"=56,
"Cacao Market"=57,
"Cacao Prieto"=58,
"Cacao Sampaka"=59,
"Cacao Store"=60,
"Cacaosuyo (Theobroma Inversiones)"=61,
"Cacaoyere (Ecuatoriana)"=62,
"Callebaut"=63,
"C-Amaro"=64,
"Cao"=65,
"Caoni (Tulicorp)"=66,
"Captain Pembleton"=67,
"Caribeans"=68,
"Carlotta Chocolat"=69,
"Castronovo"=70,
"Cello"=71,
"Cemoi"=72,
"Chaleur B"=73,
"Charm School"=74,
"Chchukululu (Tulicorp)"=75,
"Chequessett"=76,
"Chloe Chocolat"=77,
"Chocablog"=78,
"Choco Del Sol"=79,
"Choco Dong"=80,
"Chocolarder"=81,
"Chocola'te"=82,
"Chocolate Alchemist-Philly"=83,
"Chocolate Con Amor"=84,
"Chocolate Conspiracy"=85,
"Chocolate Makers"=86,
"Chocolate Tree, The"=87,
"Chocolats Privilege"=88,
"ChocoReko"=89,
"Chocovic"=90,
"Chocovivo"=91,
"Choklat"=92,
"Chokolat Elot (Girard)"=93,
"Choocsol"=94,
"Christopher Morel (Felchlin)"=95,
"Chuao Chocolatier"=96,
"Chuao Chocolatier (Pralus)"=97,
"Claudio Corallo"=98,
"Cloudforest"=99,
"Coleman & Davis"=100,
"Compania de Chocolate (Salgado)"=101,
"Condor"=102,
"Confluence"=103,
"Coppeneur"=104,
"Cote d' Or (Kraft)"=105,
"Cravve"=106,
"Creo"=107,
"Daintree"=108,
"Dalloway"=109,
"Damson"=110,
"Dandelion"=111,
"Danta"=112,
"DAR"=113,
"Dark Forest"=114,
"Davis"=115,
"De Mendes"=116,
"De Villiers"=117,
"Dean and Deluca (Belcolade)"=118,
"Debauve & Gallais (Michel Cluizel)"=119,
"Desbarres"=120,
"DeVries"=121,
"Dick Taylor"=122,
"Doble & Bignall"=123,
"Dole (Guittard)"=124,
"Dolfin (Belcolade)"=125,
"Domori"=126,
"Dormouse"=127,
"Duffy's"=128,
"Dulcinea"=129,
"Durand"=130,
"Durci"=131,
"East Van Roasters"=132,
"Eau de Rose"=133,
"Eclat (Felchlin)"=134,
"Edelmond"=135,
"El Ceibo"=136,
"El Rey"=137,
"Emerald Estate"=138,
"Emily's"=139,
"ENNA"=140,
"Enric Rovira (Claudio Corallo)"=141,
"Erithaj (A. Morin)"=142,
"Escazu"=143,
"Ethel's Artisan (Mars)"=144,
"Ethereal"=145,
"Fearless (AMMA)"=146,
"Feitoria Cacao"=147,
"Felchlin"=148,
"Finca"=149,
"Forever Cacao"=150,
"Forteza (Cortes)"=151,
"Fossa"=152,
"Franceschi"=153,
"Frederic Blondeel"=154,
"French Broad"=155,
"Fresco"=156,
"Friis Holm"=157,
"Friis Holm (Bonnat)"=158,
"Fruition"=159,
"Garden Island"=160,
"Georgia Ramon"=161,
"Glennmade"=162,
"Goodnow Farms"=163,
"Grand Place"=164,
"Green & Black's (ICAM)"=165,
"Green Bean to Bar"=166,
"Grenada Chocolate Co."=167,
"Guido Castagna"=168,
"Guittard"=169,
"Habitual"=170,
"Hachez"=171,
"Hacienda El Castillo"=172,
"Haigh"=173,
"Harper Macaw"=174,
"Heilemann"=175,
"Heirloom Cacao Preservation (Brasstown)"=176,
"Heirloom Cacao Preservation (Fruition)"=177,
"Heirloom Cacao Preservation (Guittard)"=178,
"Heirloom Cacao Preservation (Manoa)"=179,
"Heirloom Cacao Preservation (Millcreek)"=180,
"Heirloom Cacao Preservation (Mindo)"=181,
"Heirloom Cacao Preservation (Zokoko)"=182,
"hello cocoa"=183,
"hexx"=184,
"Hogarth"=185,
"Hoja Verde (Tulicorp)"=186,
"Holy Cacao"=187,
"Honest"=188,
"Hotel Chocolat"=189,
"Hotel Chocolat (Coppeneur)"=190,
"Hummingbird"=191,
"Idilio (Felchlin)"=192,
"Indah"=193,
"Indaphoria"=194,
"Indi"=195,
"iQ Chocolate"=196,
"Isidro"=197,
"Izard"=198,
"Jacque Torres"=199,
"Jordis"=200,
"Just Good Chocolate"=201,
"Kah Kow"=202,
"Kakao"=203,
"Kallari (Ecuatoriana)"=204,
"Kaoka (Cemoi)"=205,
"Kerchner"=206,
"Ki' Xocolatl"=207,
"Kiskadee"=208,
"Kto"=209,
"K'ul"=210,
"Kyya"=211,
"L.A. Burdick (Felchlin)"=212,
"La Chocolaterie Nanairo"=213,
"La Maison du Chocolat (Valrhona)"=214,
"La Oroquidea"=215,
"La Pepa de Oro"=216,
"Laia aka Chat-Noir"=217,
"Lajedo do Ouro"=218,
"Lake Champlain (Callebaut)"=219,
"L'Amourette"=220,
"Letterpress"=221,
"Levy"=222,
"Lilla"=223,
"Lillie Belle"=224,
"Lindt & Sprungli"=225,
"Loiza"=226,
"Lonohana"=227,
"Love Bar"=228,
"Luker"=229,
"Machu Picchu Trading Co."=230,
"Madecasse (Cinagra)"=231,
"Madre"=232,
"Maglio"=233,
"Majani"=234,
"Malagasy (Chocolaterie Robert)"=235,
"Malagos"=236,
"Malie Kai (Guittard)"=237,
"Malmo"=238,
"Mana"=239,
"Manifesto Cacao"=240,
"Manoa"=241,
"Manufaktura Czekolady"=242,
"Map Chocolate"=243,
"Marana"=244,
"Marigold's Finest"=245,
"Marou"=246,
"Mars"=247,
"Marsatta"=248,
"Martin Mayer"=249,
"Mast Brothers"=250,
"Matale"=251,
"Maverick"=252,
"Mayacama"=253,
"Meadowlands"=254,
"Menakao (aka Cinagra)"=255,
"Mesocacao"=256,
"Metiisto"=257,
"Metropolitan"=258,
"Michel Cluizel"=259,
"Middlebury"=260,
"Millcreek Cacao Roasters"=261,
"Mindo"=262,
"Minimal"=263,
"Mission"=264,
"Mita"=265,
"Moho"=266,
"Molucca"=267,
"Momotombo"=268,
"Monarque"=269,
"Monsieur Truffe"=270,
"Montecristi"=271,
"Muchomas (Mesocacao)"=272,
"Mutari"=273,
"Nahua"=274,
"Naive"=275,
"Na�ve"=276,
"Nanea"=277,
"Nathan Miller"=278,
"Neuhaus (Callebaut)"=279,
"Nibble"=280,
"Night Owl"=281,
"Noble Bean aka Jerjobo"=282,
"Noir d' Ebine"=283,
"Nova Monda"=284,
"Nuance"=285,
"Nugali"=286,
"Oakland Chocolate Co."=287,
"Obolo"=288,
"Ocelot"=289,
"Ocho"=290,
"Ohiyo"=291,
"Oialla by Bojessen (Malmo)"=292,
"Olive and Sinclair"=293,
"Olivia"=294,
"Omanhene"=295,
"Omnom"=296,
"organicfair"=297,
"Original Beans (Felchlin)"=298,
"Original Hawaiin Chocolate Factory"=299,
"Orquidea"=300,
"Pacari"=301,
"Palette de Bine"=302,
"Pangea"=303,
"Park 75"=304,
"Parliament"=305,
"Pascha"=306,
"Patric"=307,
"Paul Young"=308,
"Peppalo"=309,
"Pierre Marcolini"=310,
"Pinellas"=311,
"Pitch Dark"=312,
"Pomm (aka Dead Dog)"=313,
"Potomac"=314,
"Pralus"=315,
"Pump Street Bakery"=316,
"Pura Delizia"=317,
"Q Chocolate"=318,
"Quetzalli (Wolter)"=319,
"Raaka"=320,
"Rain Republic"=321,
"Rancho San Jacinto"=322,
"Ranger"=323,
"Raoul Boulanger"=324,
"Raw Cocoa"=325,
"Republica del Cacao (aka Confecta)"=326,
"Ritual"=327,
"Roasting Masters"=328,
"Robert (aka Chocolaterie Robert)"=329,
"Rococo (Grenada Chocolate Co.)"=330,
"Rogue"=331,
"Rozsavolgyi"=332,
"S.A.I.D."=333,
"Sacred"=334,
"Salgado"=335,
"Santander (Compania Nacional)"=336,
"Santome"=337,
"Scharffen Berger"=338,
"Seaforth"=339,
"Shark Mountain"=340,
"Shark's"=341,
"Shattel"=342,
"Shattell"=343,
"Sibu"=344,
"Sibu Sura"=345,
"Silvio Bessone"=346,
"Sirene"=347,
"Sjolinds"=348,
"Smooth Chocolator, The"=349,
"Snake & Butterfly"=350,
"Sol Cacao"=351,
"Solkiki"=352,
"Solomons Gold"=353,
"Solstice"=354,
"Soma"=355,
"Somerville"=356,
"Soul"=357,
"Spagnvola"=358,
"Spencer"=359,
"Sprungli (Felchlin)"=360,
"SRSLY"=361,
"Starchild"=362,
"Stella (aka Bernrain)"=363,
"Stone Grindz"=364,
"StRita Supreme"=365,
"Sublime Origins"=366,
"Summerbird"=367,
"Suruca Chocolate"=368,
"Svenska Kakaobolaget"=369,
"Szanto Tibor"=370,
"Tabal"=371,
"Tablette (aka Vanillabeans)"=372,
"Tan Ban Skrati"=373,
"Taza"=374,
"TCHO"=375,
"Tejas"=376,
"Terroir"=377,
"The Barn"=378,
"Theo"=379,
"Theobroma"=380,
"Timo A. Meyer"=381,
"To'ak (Ecuatoriana)"=382,
"Tobago Estate (Pralus)"=383,
"Tocoti"=384,
"Treehouse"=385,
"Tsara (Cinagra)"=386,
"twenty-four blackbirds"=387,
"Two Ravens"=388,
"Un Dimanche A Paris"=389,
"Undone"=390,
"Upchurch"=391,
"Urzi"=392,
"Valrhona"=393,
"Vanleer (Barry Callebaut)"=394,
"Vao Vao (Chocolaterie Robert)"=395,
"Vicuna"=396,
"Videri"=397,
"Vietcacao (A. Morin)"=398,
"Vintage Plantations"=399,
"Vintage Plantations (Tulicorp)"=400,
"Violet Sky"=401,
"Vivra"=402,
"Wellington Chocolate Factory"=403,
"Whittakers"=404,
"Wilkie's Organic"=405,
"Willie's Cacao"=406,
"Wm"=407,
"Woodblock"=408,
"Xocolat"=409,
"Xocolla"=410,
"Zak's"=411,
"Zart Pralinen"=412,
"Zokoko"=413,
"Zotter"=414
))
#TRY TO DIVIDE COUNTRIES BY CONTINENT
chocolatedata_clean <- chocolatedata_clean %>%
mutate(loc_continent = recode(
chocolatedata_clean$`Company
Location`,
"France" = 1,
"U.S.A." = 5,
"Fiji" = 4,
"Ecuador" = 6,
"Mexico" = 5,
"Switzerland" = 1,
"Netherlands" = 1,
"Spain" = 1,
"Peru" = 6,
"Canada" = 5,
"Italy" = 1,
"Brazil" = 6,
"U.K." = 1,
"Australia" = 4,
"Wales" = 1,
"Belgium"= 1,
"Germany"= 1,
"Russia"= 3,
"Puerto Rico"= 6,
"Venezuela"=6,
"Columbia"=6,
"Japan"=3,
"New Zealand"=4,
"Costa Rico"=6,
"South Korea"=3,
"Amsterdam"=1,
"Scotland"=1,
"Martinique"=6,
"Sao Tome"=2,
"Argentina"=6,
"Guatemala"=6,
"South Africa"=2,
"Bolivia"=6,
"St. Lucia"=6,
"Portugal"=6,
"Singapore"=3,
"Vietnam"=3,
"Grenada"=6,
"Israel"=2,
"India"=3,
"Czech Republic"=1,
"Dominican Republic"=6,
"Finland"=1,
"Madagascar"=2,
"Philippines"=3,
"Sweden"=1,
"Poland"=1,
"Austria"=1,
"Honduras"=6,
"Nicaragua"=6,
"Lithuania"=1,
"Chile"=6,
"Ghana"=2,
"Iceland"=1,
"Hungary"=1,
"Denmark"=1,
"Suriname"=6,
"Ireland"=1
))
#1 -- Europe; 2 -- Africa; 3 -- Asia; 4 -- Australia; 5 -- North America; 6 -- South America; 7 -- Antarctica
write.csv(chocolatedata_clean, file="chocolatedata_clean.csv")
We have made numerous graphs and charts to study the data we have acquired. Look below to see what we found!
boxplot(chocolatedata_clean$Rating~chocolatedata_clean$loc, main="Country Location vs. Rating", xlab="Country Location", ylab="Rating", col=chocolatedata_clean$`Review
Date`) #Doesn't show much of an impact on ratings
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `loc`)) +
geom_jitter() +
geom_smooth(method = "lm") #Locations have similar rating distributions; see below for breakout by continent
ggplot(data=chocolatedata_clean,
aes(y = `Rating`, x = `loc`)) +
geom_bar(stat = "identity")+
xlab("Country Location") #The U.S. (number 2) has the highest ratings compared to the other country locations
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = chocolatedata_clean$`Cocoa
Percent`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("Cocoa Percent") #This shows that as the cocoa percent increases, the perceived rating of chocolate decreases significantly.
boxplot(chocolatedata_clean$Rating~chocolatedata_clean$beantype, main="Bean Type vs. Rating", xlab="Bean Type", ylab="Rating", col=chocolatedata_clean$`Review
Date`) #Specific bean types are grouped by the review date; there doesn't seem to be much correlation between bean type and rating though
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `beantype`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("Bean Type") #Doesn't show much correlation between the two variables
boxplot(chocolatedata_clean$Rating~chocolatedata_clean$companies, main="Company vs. Rating", xlab="Company", ylab="Rating")
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `companies`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("Companies")
ggplot(data=chocolatedata_clean,
aes(y = `Rating`, x = `companies`, color=`loc`)) +
geom_bar(stat = "identity")+
xlab("Company") #The company with the most highest ratings was Soma.
boxplot(chocolatedata_clean$Rating~chocolatedata_clean$bborigin, main="Broad Bean Origin vs. Rating", xlab="BB Origin", ylab="Rating")
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `bborigin`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("Broad Bean Origin") #Doesn't show much of an effect either
boxplot(chocolatedata_clean$Rating~chocolatedata_clean$REF, main="REF vs. Rating", xlab="REF", ylab="Rating")
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `REF`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("REF") #As REF goes up, the rating of the chocolate goes up as well
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `loc_continent`)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("Continent") #didn't show too much--boxplots are better
ggplot(chocolatedata_clean,
aes(y = `Rating`, x = `loc_continent`)) +
geom_boxplot() +
facet_wrap(~`loc_continent`)
These boxplots show that Europe, Africa, Asia, Australia, North America, and South America had similar ratings, but differ in range and interquartile range. Europe has more outliers than the other plots and has a mean of around 3.25. Africa has less outliers and has a mean of around 3.12. Its interquartile range is larger at around 0.75. Asia has no outliers, a mean of 3.25 (same as Europe), and a small IQR. Australia has the highest mean at 3.45 and has a range of 1.5. North America and South America are extremely similar. They have means of 3.3 and one outlier. In conclusion, Austrailia had the highest mean ratings at 3.45, but also had the fewest data points available to analyze.
This data didn’t have as many correlations as we had hoped. The two biggest factors we found to affect chocolate rating were the chocolate’s cocoa percent and the chocolate’s REF. As we hypothesized, the higher the coca percent in the chocolate, the lower rating it got. This is because of the bitterness gained from the high concentration of the cocoa bean.
In addition, we found that the REF had an influence on the perceived rating of the chocolate. As the REF went up, the rating also went up. The REF is the reference number used for each rating. This could be caused by a couple of reasons. More people could have been craving chocolate in the more recent tests and rated it higher because of this. The U.S. has the highest number of ratings collected of the countries studied. Soma had the highest number of ratings collected of the companies studied. For the future, we should run a regression model to see if we can find anything else. Below you will find a regression model plus trees dealing with rating on several other variables.
#Predict the rating of the chocolate
chocolatereg <- chocolatedata_clean [c(3, 5, 7, 10, 11, 12, 13, 14)]
set.seed(5678)
trainIndex <- createDataPartition(chocolatereg$Rating, p = .65, list = FALSE)
chocTrain <- chocolatereg[ trainIndex,]
chocTest <- chocolatereg[-trainIndex,]
choc.lm <- lm(Rating ~ ., chocTrain, na.action=na.omit)
choc.lm
##
## Call:
## lm(formula = Rating ~ ., data = chocTrain, na.action = na.omit)
##
## Coefficients:
## (Intercept) REF `Cocoa\\nPercent`
## 4.380e+00 7.908e-05 -1.546e+00
## loc beantype bborigin
## -1.594e-03 -1.324e-03 2.071e-04
## companies loc_continent
## 8.026e-05 -2.655e-02
choc.lm.predict <- round_any(predict(choc.lm, chocTrain, na.action=na.pass),.25)
## Error in round_any(predict(choc.lm, chocTrain, na.action = na.pass), 0.25): could not find function "round_any"
table(chocTrain$Rating, choc.lm.predict, useNA = "always" )
## Error in table(chocTrain$Rating, choc.lm.predict, useNA = "always"): object 'choc.lm.predict' not found
choc.lm.predict2 <- round(choc.lm.predict, 0)
## Error in eval(expr, envir, enclos): object 'choc.lm.predict' not found
table(choc.lm.predict2) #Got it.
## Error in table(choc.lm.predict2): object 'choc.lm.predict2' not found
table(chocTrain$Rating, choc.lm.predict2, useNA = "always" )
## Error in table(chocTrain$Rating, choc.lm.predict2, useNA = "always"): object 'choc.lm.predict2' not found
plot(jitter(chocTrain$Rating,1), jitter(choc.lm.predict2,1), pch=20)
## Error in jitter(choc.lm.predict2, 1): object 'choc.lm.predict2' not found
chocTr.lm.predict <- predict(choc.lm, newdata=chocTest, na.action=na.pass)
plot(jitter(chocTest$Rating,.5), jitter(chocTr.lm.predict,.5))
chocTr.lm.predict2 <- round(chocTr.lm.predict, digits=0)
table(chocTr.lm.predict2)
## chocTr.lm.predict2
## 3
## 272
table( chocTr.lm.predict2, chocTest$Rating, useNA = "always" )
##
## chocTr.lm.predict2 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 5 <NA>
## 3 2 1 7 2 12 36 62 31 65 30 24 0 0
## <NA> 2 0 7 5 27 56 60 72 72 33 20 1 0
choc.rpart <- rpart(Rating~.,chocolatereg, maxdepth=8, na.action=na.rpart) #na.action takes out if important
## Error in `[.data.frame`(m, labs): undefined columns selected
choc.rpart # The output is a bit confusing
## Error in eval(expr, envir, enclos): object 'choc.rpart' not found
choc.rpart <- rpart(Rating~ chocolatereg$`Cocoa
Percent` + chocolatereg$loc_continent, data=chocolatereg, method="class")
rpart.plot(choc.rpart, digits=3, cex=.7, extra=2, under=TRUE)
prp(choc.rpart, digits=3, extra=101, cex=.7, box.palette = "PuBu") #cooler
rpart.plot(choc.rpart, digits=3, extra=101, cex=.7, box.palette = "PuBu")
summary(choc.rpart)
## Call:
## rpart(formula = Rating ~ chocolatereg$`Cocoa\nPercent` + chocolatereg$loc_continent,
## data = chocolatereg, method = "class")
## n= 1795
##
## CP nsplit rel error xerror xstd
## 1 0.01069138 0 1.0000000 1.000000 0.01247619
## 2 0.01000000 1 0.9893086 1.005702 0.01238338
##
## Variable importance
## chocolatereg$`Cocoa\\nPercent`
## 100
##
## Node number 1: 1795 observations, complexity param=0.01069138
## predicted class=3.5 expected loss=0.7816156 P(node) =1
## class counts: 4 10 3 32 14 127 259 341 303 392 210 98 2
## probabilities: 0.002 0.006 0.002 0.018 0.008 0.071 0.144 0.190 0.169 0.218 0.117 0.055 0.001
## left son=2 (184 obs) right son=3 (1611 obs)
## Primary splits:
## chocolatereg$`Cocoa\nPercent` < 0.785 to the right, improve=4.946878, (0 missing)
## chocolatereg$loc_continent < 1.5 to the right, improve=2.064035, (32 missing)
##
## Node number 2: 184 observations
## predicted class=2.75 expected loss=0.7717391 P(node) =0.102507
## class counts: 1 6 2 10 2 17 42 38 31 27 6 2 0
## probabilities: 0.005 0.033 0.011 0.054 0.011 0.092 0.228 0.207 0.168 0.147 0.033 0.011 0.000
##
## Node number 3: 1611 observations
## predicted class=3.5 expected loss=0.7734327 P(node) =0.897493
## class counts: 3 4 1 22 12 110 217 303 272 365 204 96 2
## probabilities: 0.002 0.002 0.001 0.014 0.007 0.068 0.135 0.188 0.169 0.227 0.127 0.060 0.001
This regression model showed many things about our chocolate data. We first selected the variables that were of importance to us. Next, we established a testing and training set to work with. We ran predictions and lms and found how many people and the distribution of individuals who responded each of the ratings. Next, we plotted our rounded data and it lookes fairly normally distributed with most of the ratings being at 3 and spreading out from there. We then ploted our testing data. The scatter looks like our data was fairly well predicted. After making a table of our test data, it also shows that most data was around 3.5 and higher. Most chocolate wasn’t scored lowly. Lastly, we made some trees. If the cocoa percent was above 0.785, the rating was .75 lower than if it was lower than 0.785. The next trees show predictions for the other variables given cocoa percent.
From running the summary we found that cocoa percent improved the prediction by 4.95 and continent improved the prediction by 2.06. Our data has shown us many things about chocolate, but mainly that it is more complex than we thought. Many variables play a factor in the rating of chocolate, including cocoa percent, REF, country, and continent.