\(\textbf{Display of the first 6 rows of our flavors dataset with variables cocoa percent, rating, and bean type.}\)
| cocoa_percent | rating | bean_type_new | ID |
|---|---|---|---|
| 63 | 3.75 | other | 1 |
| 70 | 2.75 | other | 2 |
| 70 | 3.00 | other | 3 |
| 70 | 3.50 | other | 4 |
| 70 | 3.50 | other | 5 |
| 70 | 2.75 | Criollo | 6 |
\(\textbf{Context of Data:}\) In 2005, Brady Belinski went to a chocolate tasting. This night ignited in him a passion for exploring the world of high quality chocolate. His website FlavorsofCacao.com, the source of our dataset, aims to identify and categorize the flavors tasted in plain origin dark chocolate based solely on the country of origin of the cacao. The author focuses mainly on plain dark chocolate with “an aim of appreciating the flavors of the cacao when made into chocolate.” The dataset is his attempt to establish a flavor profile for various chocolates sourced from all over the world.
\(\textbf{Source of Data:}\) The author tasted, catalogued, and rated over 1,800 plain dark chocolate bars. Each row in the dataset corresponds to a chocolate bar Brady himself reviewed. The rating represents an experience with one chocolate bar from one batch and is based solely on flavor not health benefits, social missions, or organic status. Posted immediately below is his rating scale:
5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)
\(\textbf{Research Question:}\) The original idea for FlavorsofCacao.com was for it to be a project set out to identify and categorize the flavors you’d taste in plain origin chocolate based solely on the country of origin of the cacao (the main ingredient of chocolate). For our project, we decided to not compare flavor ratings with country of origin because 1) some countries are unavailable for certain chocolates and 2) the dataset includes a broad array of countries, which would be difficult to compare all at once. We decided to address whether chocolate bars with certain cacao percentages and bean types have higher ratings than other chocolate bars.
\(\textbf{Limitations of the data:}\)
Not all the rows are complete because the author is still compiling data about the chocolate bars.
In cleaning up our data set, we took the top 3 most prevalent bean types: criollo, forastero, and trinitario. We realized afterward that some chocolate bars came from two different bean types (e.g. criollo, trinitario). Although some chocolate bars may come from two different beans, to simplify things, we decided to classify the two-bean bars as just one type. Therefore, some of the rows in our cleaned up dataset (i.e. flavors), which show only one bean type, may in reality have two bean types. However, we believed this would not significantly affect our manipulations.
The data comes from one person, the author of FlavorsofCacao.com. This may be both good and not so good. Good: all chocolate bars are rated similarly. Because this is from solely one person, it is hard to tell if these ratings are totally accurate and reliable, though he is a chocolate connoisseurs.
## Observations: 1,795
## Variables: 4
## $ cocoa_percent <dbl> 63, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, ...
## $ rating <dbl> 3.75, 2.75, 3.00, 3.50, 3.50, 2.75, 3.50, 3.50, ...
## $ bean_type_new <fct> other, other, other, other, other, Criollo, othe...
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## rating bean_type_new cocoa_percent
## Min. :1.000 Criollo :175 Min. : 42.0
## 1st Qu.:2.875 Forastero :195 1st Qu.: 70.0
## Median :3.250 other :949 Median : 70.0
## Mean :3.186 Trinitario:476 Mean : 71.7
## 3rd Qu.:3.500 3rd Qu.: 75.0
## Max. :5.000 Max. :100.0
## rating cocoa_percent
## rating 1.0000000 -0.1648202
## cocoa_percent -0.1648202 1.0000000
There’s a negative relationship between cocoa percentage and chocolate bar rating. As percentage of cocoa increases, the chocalate bar rating decreases. The relationship doesn’t appear as strong, given the weakly negative correlation coeffication of -0.165.
The Forastero chocolate bar has the lowest median rating of 3, while the rest of the three bean types have approximately the same median chocolate bar rating of 3.5
The chocolate bar rating of Forastero chocolate drops quickest in relation to the increase in cocoa percentage, displaying the steepest slope of the four bean types.
\(\textbf{Components of our multiple regression model:}\)
In our multiple regression model, the variables we are looking at are cocoa percentage, bean type, and rating of dark chocolate bars.
\(\textbf{Explanatory/predictor variables: }\)
\(\textbf{Outcome variable: }\)
| term | estimate | std_error | statistic | p_value | conf_low | conf_high |
|---|---|---|---|---|---|---|
| intercept | 4.163 | 0.131 | 31.801 | 0.000 | 3.907 | 4.420 |
| cocoa_percent | -0.012 | 0.002 | -7.112 | 0.000 | -0.016 | -0.009 |
| bean_type_newForastero | -0.150 | 0.049 | -3.065 | 0.002 | -0.246 | -0.054 |
| bean_type_newother | -0.117 | 0.039 | -3.039 | 0.002 | -0.193 | -0.042 |
| bean_type_newTrinitario | -0.021 | 0.041 | -0.505 | 0.614 | -0.102 | 0.060 |
\(\textbf{Interpretation of our Table using Statistical Language}\)
The modeling equation is: \(\hat{rating} = b_0 + b_{cocoa~percent} * cocoa~percent~+ b_{Forastero} * 1[is~Forastero]~+ b_{other} * 1[is~other] ~+~ b_{Trinitario} * 1[is~Trinitario]\)
\(\textbf{Components:}\)
\(\textbf{Tying in Results of Multiple Regression Table with our Exploratory Data Analysis:}\)
As initially observed in the exploratory data analysis, all four bean types displayed negative slopes in the graph representing the three variables of bean type, cocoa percentage, and chocolate bar rating. The negative slope values obtained using the regression table (-0.0120) further supports our preliminary observation.
\(\textbf{Potential Limitations in our Analysis:}\)
One potential limitiation of our analysis is that a majority of chocolate bars have cocoa percentages in the 70% range. There are fewer chocolate bars with cocoa percentages lower than 50% than there are bars with 100% cocoa. This unequal distribution of the percentage of cocoa in the bars may bias the relationship between cocoa percentage and chocolate bar rating, possibly affecting the observed slope between the two variables.
In using the parallel slopes model, we assume that the associated effect of cocoa percent on chocolate bar rating is the same for the Criollo, Forastero, Trinitario, and other bean types, which is not necessarily true.
Additionally, the relationship between cocoa percentage and rating may not be linear. So, fitting a straight line in our regression model may not describe the relationship between rating and cocoa percentage well.
Another limitation of our analysis is that we grouped a variety of bean types in the category “other” to compare 4 levels instead of 12. Individuals beans may have a certain relationship with rating, but grouping them all together in one level may show a different relationship (i.e. Simpson’s Paradox).
In addition, there were different types of Criollo and Forastero beans (e.g. Forastero (Arriba), Criollo (Porcelana)), but in our analysis we grouped all Forastero beans together and all Criollo beans together. Therefore, we do not know if the specific type of Forastero bean may contain an outlier that influences the chocolate rating.
Under this model, as chocolate becomes darker (contains higher percentage of cocoa), the rating of the chocolate made with the bean type Forastero drops quicker compared to the other three bean types.
Note: This section is to be skipped for the initial submission and completed for the resubmission.
Note: This section is to be skipped for the initial submission and completed for the resubmission.
Optional: If you have any other materials that you think are interesting, but not directly relevant to the project. For example interesting observations or a cool visualization.