1 Introduction

The research question we will be addressing with our multiple regression model is: What is the relationship between (a) a chocolate bar’s rating and (b) its bean type and (c) its cocoa percent? where (a) is the outcome variable, (b) is the categorical explanatory/predictor variable, and (c) is the numerical explanatory/predictor variable.

This means that we would like to find out how a chocolate bar has been rated based on the type of bean(s) it is composed of and the percentage of cocoa present in it. In general terms, do people prefer chocolate that is darker and has a higher cocoa percent (more bitter), or do people prefer chocolate that is sweeter and has a low cocoa content(less bitter)?

This data set, called “Flavors of Cacao” has been imported from Kaggle. [Kaggle] (https://www.kaggle.com/rtatman/chocolate-bar-ratings/data).

It contains variables of type “string” as well as variables of type “numeric”. These ratings were compiled by Brady Brelinski the Founding Member of the Manhattan Chocolate Society.

One limitation is that the data in this set focuses on plain, dark chocolate, while other types of chocolate, such as milk and white chocolate, have not been taken into consideration. Thus, we do not know how other types of chocolate would be rated compared to dark chocolate. Another limitation is that the data set has not been updated with recent information, so it is not completely accurate. One cannot rely on it for current statistics.


2 Exploratory data analysis

observation beantype cocoapercent rating
1 Forastero (Arriba) 0.55 2.75
2 Forastero (Arriba) 0.70 3.00
3 Forastero 0.75 2.75
4 Forastero (Nacional) 0.70 3.50
5 Criollo, Trinitario 0.70 3.50
6 Forastero 0.70 3.50

Each row in the data set is a different type of bean. Each point in the scatter plot represents the rating of a single chocolate bar of a certain bean type, which is shown using color.


3 Multiple regression

term estimate std_error statistic p_value conf_low conf_high
intercept 4.892 0.303 16.162 0.000 4.296 5.488
cocoapercent -2.155 0.410 -5.253 0.000 -2.962 -1.347
beantypeCriollo, Trinitario -0.041 0.110 -0.376 0.707 -0.257 0.175
beantypeForastero -0.239 0.093 -2.568 0.011 -0.422 -0.056
beantypeForastero (Arriba) -0.485 0.111 -4.352 0.000 -0.704 -0.265
beantypeForastero (Nacional) -0.049 0.103 -0.481 0.631 -0.252 0.153

Our multiple regression model is a parallel slopes model. We haven chosen this model because the confidence intervals for all the interaction terms include 0. This suggests that the interaction effect is zero. So to reduce the extra complexity of our model, we have used the parallel slopes model.The outcome variable (y) is “rating”, the numerical explanatory variable (x) is “cocoapercent” and the categorical explanatory variable (x’) is “beantype”.

3.1 Statistical interpretation

  1. Beantype Blend: \(\widehat{Rating} = 4.892 - 2.155 * cocoapercent\)
  2. Beantype Criollo, Trinitario: \(\widehat{Rating} = (4.892 -0.041) - 2.155 * cocoapercent = 4.851 - 2.155 * cocoapercent\)
  3. Beantype Forastero: \(\widehat{Rating} = (4.892 -0.239) - 2.155 * cocoapercent = 4.653 - 2.155 * cocoapercent\)
  4. Beantype Forastero (Arriba): \(\widehat{Rating} = (4.892 -0.485) - 2.155 * cocoapercent = 4.407 - 2.155 * cocoapercent\)
  5. Beantype Forastero (Nacional): \(\widehat{Rating} = (4.892 -0.049) - 2.155 * cocoapercent = 4.843 - 2.155 * cocoapercent\)

In our regression model, the bean type “Blend” has been treated as the baseline since it comes first alphabetically.

  • Intercept = 4.892: It is the average rating of chocolate bars for cocoa percent 0.
  • cocoapercent = -2.155: It is the baseline slope of cocoa percent for bean type “Blend”.
  • beantypeForastero = -0.239: It is the difference in intercept for this bean type and bean type “Blend”(the baseline).
  • beantypeCriollo, Trinitario = -0.041: It is the difference in intercept for this bean type and bean type “Blend”(the baseline).

According to our exploratory data analysis visualizations, we see through the downward sloping lines of best fit that there is an associated decrease in ratings as the cocoa percent for a chocolate bar increases. However, the average rating of chocolate bars for cocoa percent 0 varies for each bean type. This relative difference is shown by the intercept values displayed in the regression table. For example, the intercept of the line of best fit for bean type Criollo is 4.851, compared to the intercept of the line of best fit for bean type Blend, which is 4.892. Since we are using a parallel slopes model and each line of best fit has the same slope, this indicates that the ratings for chocolate bars of bean type Criollo, Trinitario are, overall, less than the ratings for chocolate bars of bean type Blend.

Limitations of our analysis:

  1. Our outcome variable, “Rating”, is subjective and can be influenced by people’s biases and surroundings.
  2. In the data set, some of the chocolate bar names did not have a bean type associated with them so we had to remove those chocolate bars from our analysis. Removing missing data makes our exploratory analysis less accurate and less reliable.

3.2 Non-statistical interpretation

The model suggests that chocolate bar ratings are higher for less bitter chocolates (chocolate with lower cocoa content.) This is true for all five bean types we are analyzing in this model. The rate at which the chocolate bar ratings decrease is the same for each bean type, which indicates that the ratings for chocolate bars are highest for the bean type with the highest predicted rating for a bar with no cocoa, and lowest for the bean type with the lowest predicted rating for a bar with no cocoa.


4 Inference for multiple regression

ID rating cocoapercent beantype rating_hat residual
1 2.75 0.55 Forastero (Arriba) 3.222 -0.472
2 3.00 0.70 Forastero (Arriba) 2.899 0.101
3 2.75 0.75 Forastero 3.037 -0.287
4 3.50 0.70 Forastero (Nacional) 3.334 0.166
5 3.50 0.70 Criollo, Trinitario 3.342 0.158
6 3.50 0.70 Forastero 3.145 0.355

Based on the pattern of the residual analysis in the scatterplot, we can conclude that our model does NOT satisfy the conditions for statistical inference here. Regardless, we have analyzed the p-values and confidence intervals below. However, we would not normally continue on to conduct this analysis, because it does not sufficiently explain the results of our model

We are 95% confident that the slope of rating over cocoa percent lies between -2.96 and -1.35. Since the estimate of -2.15 falls within the confidence interval, the result is practically significant.

We are 95% confident the intercept of ratings for bean type “Forastero”, or the predicted rating for a chocolate bar with cocoa percent 0, lies between -0.422 and -0.0560. Since the estimate of -0.239 falls within the confidence interval, the result is practically significant. We can generalize this to say that all estimates for ratings of chocolate bars with bean type “Forastero” are practically significant. Since we are using a parallel slopes model, all ratings will be 0.239 less than the baseline ratings.

Setting up the hypothesis,

\[ \begin{align} & H_0: \beta_0 = 0\\ \text{vs } & H_A : \beta_0 \neq 0 \end{align} \]

Taking an alpha significance value of 0.05, The p value for the intercept of the line of best fit for the Forastero bean type is 0.011. Since the p value is less than alpha, we reject the null hypothesis. So we can say that there is a statistically significant relationship between rating and cocoa percent and bean type.

Setting up the hypothesis \[ \begin{align} & H_0: \beta_1 = 0\\ \text{vs } & H_A : \beta_1 \neq 0 \end{align} \] Taking an alpha significance value of 0.05, The p value for the slope of rating over cocoa percent for each line of best fit in the parallel slopes model is 0.000. Since the p value is less than alpha, we reject the null hypothesis. So we can say that there is a statistically significant relationship between rating and cocoa percent and bean type.

Note: Here, we take an alpha significance value of 0.05. We take this value, because chocolate ratings are completely subjective. To account for this subjectivity, we are taking a liberal value.


5 Conclusion

Based on our analysis, we can conclude that for all chocolate bars made with one of the five bean types we have included in our data, the ratings for these chocolate bars decrease as the chocolate becomes more bitter, or as the cocoa percent in the chocolate bar increases. From this, we can take away that people prefer sweeter over bitter chocolate for bars made from these five bean types.

Limitations and caveats:

  1. There is not a constant spread of residuals so our model does not satisfy the conditions for statistical inference. Our analysis is missing some variable which is why the model for residuals does not have a constant spread. A linear model is not sufficient to conduct this analysis. For this reason, we cannot look at the p-values and confidence intervals we have computed.
  2. Some of the bean types were grouped together based on similar properties to yield five bean types. For this reason, we have not included all bean types that were originally included in the data set. We have chosen five representative bean types to analyse here.
  3. The intercepts of our analysis cannot have any practical significance because no chocolate bar can have a cocoa percent zero.

In the future, we can refine our analysis by including more bean types than just the five we have included here. In addition, since we know from our residual analysis that our linear model does not fit the data, we would run another, more fitting model to conduct a better analysis.


6 Citations and References


7 Supplementary Materials

alt text

alt text

This is a graph imported from Kaggle which tells where the best cocoa beans are grown in the world!