1 Introduction

\(\textbf{Context of Data}\)

In 2005, Brady Belinski went to a chocolate tasting. This event ignited in him a passion for exploring the world of high quality chocolate. His website FlavorsofCacao.com aims to identify and categorize the flavors tasted in plain origin dark chocolate based solely on the country of origin of the cocoa. The author focuses mainly on plain dark chocolate with “an aim of appreciating the flavors of the cocoa when made into chocolate.” He created the Flavors of cocoa dataset in an attempt to establish a flavor profile for various chocolates sourced from all over the world.

\(\textbf{Source of Data}\)

The author tasted, cataloged, and rated over 1,800 plain dark chocolate bars. Each row in the dataset corresponds to a chocolate bar Brady himself reviewed. The rating represents an experience with one chocolate bar from one batch and is based solely on flavor and not on health benefits, social missions, or organic status.

  • \(Rating~Scale\)
  1. Unpleasant (Mostly unpalatable)
  2. Disappointing (Passable but contains at least one significant flaw)
  3. Satisfactory to praiseworthy (3.75) (Well made with special qualities)
  4. Premium (Superior flavor development, character and style)
  5. Elite (Transcending beyond the ordinary limits)

\(\textbf{Research Question}\)

The FlavorsofCacao.com project set out to identify and categorize the flavors in chocolate based solely on the country of origin of the cocoa (the main ingredient of chocolate).

For our project, we decided to not compare flavor ratings with country of origin because 1) some chocolates did not list country of origin, and 2) the dataset includes a broad array of countries, which would be difficult to compare all at once. We decided to investigate whether chocolate bars with certain cocoa percentages and bean types have higher ratings than other chocolate bars.

\(\textbf{Limitations of the Data}\)

  1. Not all the rows are complete because the author is still compiling data for the chocolate bars.

  2. In cleaning up our data set, we took the top 3 most prevalent bean types: Criollo, Forastero, and Trinitario. We realized afterward that some chocolate bars came from two different bean types (e.g. Criollo, Trinitario). Although some chocolate bars come from two different beans, to simplify things, we decided to classify the two-bean bars as just one type, depending on which type was listed first. Therefore, some of the rows in our cleaned up dataset (i.e. flavors), which show only one bean type, may in reality have two bean types. However, we believed this would not significantly affect our manipulations.

  3. The data comes from one person, the author of FlavorsofCacao.com, which has both pros and cons. We do not need to worry that the personal preferences of multiple raters will bias the rating. But because these ratings are from one person—even if he is a chocolate connoisseur—it is hard to tell if they can be generalized.


2 Exploratory Data Analysis

Raw Data Glimpse
cocoa_percent rating bean_type_new ID
63 3.75 other 1
70 2.75 other 2
70 3.00 other 3
70 3.50 other 4
70 3.50 other 5
70 2.75 Criollo 6
Summary Statistics
count avg_rating avg percent IQR
1795 3.185933 71.69833 0.625
Summary Statistics by Type
bean_type_new count avg rating avg percent IQR
Criollo 175 3.267143 71.90857 0.50
Forastero 195 3.112821 72.27692 0.75
other 949 3.154373 71.54584 0.75
Trinitario 476 3.248950 71.68803 0.50

Correlation Matrix

##                   rating cocoa_percent
## rating         1.0000000    -0.1648202
## cocoa_percent -0.1648202     1.0000000

The box plot shows the median cocoa percentage of the chocolate bars across different bean types. The bean type “other” has the highest bean type IRQ because there are so many different beans grouped under that category. The Forastero chocolate bar has the lowest median rating of 3, while the rest of the three bean types have approximately the same median chocolate bar rating of 3.5. Most points appear to be in the 60-80% cocoa percentage range.

The faceted scatterplot shows that the “other” category also contains the most points, followed by Trinitario, Forastero, and Criollo. Forastero chocolate bars also tend to have a wider range of cocoa percentages than the Criollo and Trinitario bean types.

In the colored scatter plot above, each point represents a chocolate bar that Brady Belinski tasted Its x coordinate represents the cocoa percentage of the bar, the y coordinate is the rating Belinski assigned, and its color indicates the bean type of the bar.

The colored scatter plot shows a negative relationship between cocoa percentage and chocolate bar rating. As percentage of cocoa increases, there is an associated decrease in chocolate bar rating. Given the weakly negative correlation coefficient of -0.165, the relationship between the variables does not seem very strong. Additionally, the chocolate bar rating of Forastero chocolate drops quickest in relation to the increase in cocoa percentage, displaying the steepest slope of the four bean types.


3 Multiple Regression

\(\textbf{Components of our Multiple Regression Model}\)

In our multiple regression model, we are looking at variables cocoa percentage, bean type, and rating of dark chocolate bars.

\(\textbf{Explanatory/Predictor Variables}\)

  • numerical variable: cocoa percentage
  • categorical variable: bean type (Criollo, Forastero, Trinitario, and Other)

\(\textbf{Outcome Variable}\)

  • numerical variable: rating (1-5, 1= Unpleasant, 5= Elite)

\(\textbf{Usage of Interaction Slopes Model for Multiple Regression}\)

We decided to use the interaction slopes model over parallel slopes model for multiple regression because when we looked at the confidence intervals for the interaction effect of cocoa percent and bean type for the different bean types, at least one of the bean types’ confidence interval did not include 0 (i.e. “Criollo” and “Forastero”), suggesting that the associated effect of cocoa percentage on chocolate bar rating is truly different for those bean types.

term estimate std_error statistic p_value conf_low conf_high
intercept 3.797 0.458 8.287 0.000 2.898 4.695
cocoa_percent -0.007 0.006 -1.159 0.247 -0.020 0.005
bean_type_newForastero 1.070 0.545 1.965 0.050 0.002 2.139
bean_type_newother 0.120 0.488 0.246 0.806 -0.837 1.077
bean_type_newTrinitario 0.026 0.545 0.048 0.962 -1.044 1.096
cocoa_percent:bean_type_newForastero -0.017 0.008 -2.244 0.025 -0.032 -0.002
cocoa_percent:bean_type_newother -0.003 0.007 -0.486 0.627 -0.017 0.010
cocoa_percent:bean_type_newTrinitario -0.001 0.008 -0.085 0.932 -0.015 0.014

3.1 Statistical Interpretation

\(\textbf{Interpretation of our Table using Statistical Language}\)

The modeling equation according to the interaction model is:

\(\widehat{rating} = b_0 + b_{cocoa~percent} * cocoa~percent~+\) \(b_{Forastero} * 1[is~Forastero]~+ b_{cocoa~percent,~Forastero} * cocoa~percent~ * 1[is~Forastero]~+\) \(b_{other} * 1[is~other]~+ b_{cocoa~percent,~other} * cocoa~percent~ * 1[is~other]~+\) \(~ b_{Trinitario} * 1[is~Trinitario]~+ b_{cocoa~percent,~Trinitario} * cocoa~percent~ * 1[is~Trinitario]~\)

\(\textbf{Modeling~Equation~for~each~level}\)

  • Criollo:
    • \(\widehat{rating} = b_0 + b_{cocoa~percent}* cocoa~percent~\)
      • \(= 3.797 - 0.007 * cocoa~percent~\)
  • Forastero:
    • \(\widehat{rating} = b_0 + b_{cocoa~percent} * cocoa~percent~+~b_{Forastero} * 1[is~Forastero]~+~b_{cocoa~percent,~Forastero} * cocoa~percent~ * 1[is~Forastero]~\)
    • \(\widehat{rating} = (3.797 + 1.070) + (- 0.007 - 0.017) * cocoa~percent\)
  • other:
    • \(\widehat{rating} = b_0 + b_{cocoa~percent} * cocoa~percent~+~b_{other} * 1[is~other]~+~b_{cocoa~percent,~other} * cocoa~percent~ * 1[is~other]~\)
    • \(\widehat{rating} = (3.797 + 0.120) + (- 0.007 - 0.003) * cocoa~percent\)
  • Trinitario:
    • \(\widehat{rating} = b_0 + b_{cocoa~percent} * cocoa~percent~+~b_{Trinitario} * 1[is~Trinitario]~+~b_{cocoa~percent,~Trinitario} * cocoa~percent~ * 1[is~Trinitario]~\)
    • \(\widehat{rating} = (3.797 + 0.026) + (- 0.007 - 0.001) * cocoa~percent\)

\(\textbf{Interpretations}\)

  • intercept: 3.797
    • The intercept is a starting value and indicates that a chocolate bar of 0% cocoa will have a rating of 3.797. Since there are no chocolate bars with this cocoa percentage (would it even be chocolate?), this value has no meaningful interpretation.
  • cocoa_percent: -0.007
    • The slope indicates that for every increase of 1 cocoa percent, there is an associated decrease in rating by -0.007 for chocolate made from Criollo beans, accounting/controlling for all other variables in model.
  • cocoa_percent:bean_type_newForastero: -0.017
    • The value -0.017 is the bump in slope for Forastero beans compared to Criollo beans. As cocoa percent increases, the rating for Forastero beans decreases 0.017 faster than that of Criollo beans, accounting for all other variables in model.
  • cocoa_percent:bean_type_newTrinitario: -0.001
    • The value -0.001 is the bump in slope for Trinitario beans compared to Criollo beans.

\(\textbf{Tying in Results of Multiple Regression Table with our Exploratory Data Analysis}\)

As initially observed in the exploratory data analysis, all four bean types displayed negative slopes in the graph representing the three variables of bean type, cocoa percentage, and chocolate bar rating. The negative slope values obtained using the regression table (-0.0120) further supports our preliminary observation.

\(\textbf{Potential Limitations in our Analysis}\)

One potential limitation of our analysis is that a majority of chocolate bars have cocoa percentages in the 70% range. There are fewer chocolate bars with cocoa percentages lower than 50% than there are bars with 100% cocoa. This unequal distribution of the percentage of cocoa in the bars may bias the relationship between cocoa percentage and chocolate bar rating, possibly affecting the observed slope between the two variables.

Additionally, the relationship between cocoa percentage and rating may not be linear. So, fitting a straight line in our regression model may not describe the relationship between rating and cocoa percentage well.

Another limitation of our analysis is that we grouped a variety of bean types in the category “other” to compare 4 levels instead of 12. Individuals beans may have a certain relationship with rating, but grouping them all together in one level may show a different relationship (i.e. Simpson’s Paradox).

In addition, there were different types of Criollo and Forastero beans (e.g. Forastero (Arriba), Criollo (Porcelana)), but in our analysis we grouped all Forastero beans together and all Criollo beans together. Therefore, we do not know if the specific type of Forastero bean may contain an outlier that influences the chocolate rating.

3.2 Non-statistical Interpretation

Under this model, as chocolate becomes darker (contains higher percentage of cocoa), the rating of the chocolate made with the bean type Forastero drops quicker compared to the other three bean types.


4 Inference for Multiple Regression

\(\bf{Residual~Analysis}\)

\(Linearity:\) Although the spread of points is wide around the regression line, there is still a linear relationship between variables cocoa percentage and chocolate bar rating for all four bean types. There are no drastic patterns observed.

\(Residual~Scatter~Plot:\) Plots show constant spread across all bean types, with no obvious patterns, extreme outliers, or heteroskedacity.

\(Residual~Histogram:\) Plot is roughly normal.

All the assumptions for inference are met.

\(\bf{Hypotheses,~CI,~P-values,~and~Conclusions}\)

\(Slope~of~cocoa~Percent\)

  • \(H_{0}:\) cocoa percent slope = 0

  • \(H_{A}:\) cocoa percent slope ≠ 0

  • \(CI:\) [-0.0200, 0.00500]

Since p-value (0.247) is greater than 0.05, we fail to reject the null hypothesis. The confidence interval does include 0. Therefore, there is no meaningful relationship between cocoa percentage and chocolate bar rating for Criollo bean.

\(Slope~Bump~for~Forastero\)

  • \(H_{0}:\) Bump over Criollo slope for Forastero = 0

  • \(H_{A}:\) Bump over Criollo slope for Forastero ≠ 0

  • \(CI:\) [-0.0320, -0.00200]

Since p-value (0.025) is less than 0.05, we reject the null hypothesis. The confidence interval does not include 0. Therefore, there is a meaningful relationship between cocoa percentage and chocolate bar rating for Forastero bean compared to Criollo.

\(Slope~Bump~for~Trinitario\)

  • \(H_{0}:\) Bump over Criollo slope for Trinitario = 0

  • \(H_{A}:\) Bump over Criollo slope for Trinitario ≠ 0

  • \(CI:\) [-0.0150, 0.0140]

Since p-value (0.932) is greater than 0.05, we fail to reject the null hypothesis. The confidence interval does include 0. Therefore, there is no meaningful relationship between cocoa percentage and chocolate bar rating for Trinitario bean compared to Criollo.

Only Forastero has a significant decrease in slope compared to Criollo. The other bean types do not have a difference in slope compared to Criollo.


5 Conclusion

We decided to address whether chocolate bars with certain cocoa percentages and bean types have higher ratings than other chocolate bars. We found that as cocoa percentage decreases (chocolate becomes less dark), the chocolate bar rating decreases across all bean types. Based on the multiple regression model and the confidence intervals and p-values, it seems that Forastero shows a greater drop in the chocolate bar rating than the rest of the bean types in the study. In addition, the bean types Criollo, other, and Trinitario seem to show similar decreases in rating as their cocoa percentage increases.

Therefore, we conclude that increasing cocoa percentage is associated with a decrease in ratings. Besides Forastero, there is minimal difference in ratings across bean types Criollo, Trinitario, and other.

One caveat of our analysis is that three of the levels in the categorical variable bean type are relative to baseline Criollo. Therefore, our analysis is always relative to Criollo and we cannot make absolute claims about chocolate bar ratings. Another limitation is that this is one person’s rating. Because taste in chocolate may differ among different people, we cannot generalize the rating into recommendations for other people.

If we were to collect more data, we could analyze the bean types in the “other” level to see if their ratings are related to cocoa percentage. Another future direction could be investigating whether a causal relationship exists between bean type ratings, chocolate bar rating, and cocoa percentage via experiment. Due to other factors that can influence the ratings, we cannot make causal claims using the data we now have. In addition, it would be interesting to have another chocolate connoisseur chocolate bars tasted in the study to see how personal preference could influence ratings.

For a chocolate bar company, our data may be useful feedback about consumer tastes in chocolate bars, possibly allowing them to settle on a ideal cocoa percentage range for each bean type. Based on this analysis, the company conduct an experiment to determine if cocoa percentage or bean type have a direct impact on chocolate bar ratings.


Supplementary Materials

Our Different Bean Types!

Our Different Bean Types!