The research question we will be addressing with our multiple regression model is: What is the relationship between (a) a chocolate bar’s rating and (b) its bean type and (c) its cocoa percent? where (a) is the outcome variable, (b) is the categorical explanatory/predictor variable, and (c) is the numerical explanatory/predictor variable.
This means that we would like to find out how a chocolate bar has been rated based on the type of bean(s) it is composed of and the percentage of cocoa present in it. In general terms, do people prefer chocolate that is darker and has a higher cocoa percent (more bitter), or do people prefer chocolate that is sweeter and has a low cocoa content(less bitter)?
This data set, called “Flavors of Cacao” has been imported from Kaggle. [Kaggle] (https://www.kaggle.com/rtatman/chocolate-bar-ratings/data).
It contains variables of type “string” as well as variables of type “numeric”. These ratings were compiled by Brady Brelinski the Founding Member of the Manhattan Chocolate Society.
One limitation is that the data in this set focuses on plain, dark chocolate, while other types of chocolate, such as milk and white chocolate, have not been taken into consideration. Thus, we do not know how other types of chocolate would be rated compared to dark chocolate. Another limitation is that the data set has not been updated with recent information, so it is not completely accurate. One cannot rely on it for current statistics.
| observation | beantype | cocoapercent | rating |
|---|---|---|---|
| 1 | Forastero (Arriba) | 0.55 | 2.75 |
| 2 | Forastero (Arriba) | 0.70 | 3.00 |
| 3 | Forastero | 0.75 | 2.75 |
| 4 | Forastero (Nacional) | 0.70 | 3.50 |
| 5 | Criollo, Trinitario | 0.70 | 3.50 |
| 6 | Forastero | 0.70 | 3.50 |
Each row in the data set is a different type of bean. Each point in the scatter plot represents the rating of a single chocolate bar of a certain bean type, which is shown using color.
| term | estimate | std_error | statistic | p_value | conf_low | conf_high |
|---|---|---|---|---|---|---|
| intercept | 4.892 | 0.303 | 16.162 | 0.000 | 4.296 | 5.488 |
| cocoapercent | -2.155 | 0.410 | -5.253 | 0.000 | -2.962 | -1.347 |
| beantypeCriollo, Trinitario | -0.041 | 0.110 | -0.376 | 0.707 | -0.257 | 0.175 |
| beantypeForastero | -0.239 | 0.093 | -2.568 | 0.011 | -0.422 | -0.056 |
| beantypeForastero (Arriba) | -0.485 | 0.111 | -4.352 | 0.000 | -0.704 | -0.265 |
| beantypeForastero (Nacional) | -0.049 | 0.103 | -0.481 | 0.631 | -0.252 | 0.153 |
Our multiple regression model is a parallel slopes model. We haven chosen this model because the confidence intervals for all the interaction terms include 0. This suggests that the interaction effect is zero. So to reduce the extra complexity of our model, we have used the parallel slopes model.The outcome variable (y) is “rating”, the numerical explanatory variable (x) is “cocoapercent” and the categorical explanatory variable (x’) is “beantype”.
In our regression model, the bean type “Blend” has been treated as the baseline since it comes first alphabetically.
According to our exploratory data analysis visualizations, we see through the downward sloping lines of best fit that there is an associated decrease in ratings as the cocoa percent for a chocolate bar increases. However, the average rating of chocolate bars for cocoa percent 0 varies for each bean type. This relative difference is shown by the intercept values displayed in the regression table. For example, the intercept of the line of best fit for bean type Criollo is 4.851, compared to the intercept of the line of best fit for bean type Blend, which is 4.892. Since we are using a parallel slopes model and each line of best fit has the same slope, this indicates that the ratings for chocolate bars of bean type Criollo, Trinitario are, overall, less than the ratings for chocolate bars of bean type Blend.
Limitations of our analysis:
The model suggests that chocolate bar ratings are higher for less bitter chocolates (chocolate with lower cocoa content.) This is true for all five bean types we are analyzing in this model. The rate at which the chocolate bar ratings decrease is the same for each bean type, which indicates that the ratings for chocolate bars are highest for the bean type with the highest predicted rating for a bar with no cocoa, and lowest for the bean type with the lowest predicted rating for a bar with no cocoa.
| ID | rating | cocoapercent | beantype | rating_hat | residual |
|---|---|---|---|---|---|
| 1 | 2.75 | 0.55 | Forastero (Arriba) | 3.222 | -0.472 |
| 2 | 3.00 | 0.70 | Forastero (Arriba) | 2.899 | 0.101 |
| 3 | 2.75 | 0.75 | Forastero | 3.037 | -0.287 |
| 4 | 3.50 | 0.70 | Forastero (Nacional) | 3.334 | 0.166 |
| 5 | 3.50 | 0.70 | Criollo, Trinitario | 3.342 | 0.158 |
| 6 | 3.50 | 0.70 | Forastero | 3.145 | 0.355 |
Based on the pattern of the residual analysis in the scatterplot, we can conclude that our model does NOT satisfy the conditions for statistical inference here. Regardless, we have analyzed the p-values and confidence intervals below. However, we would not normally continue on to conduct this analysis, because it does not sufficiently explain the results of our model
We are 95% confident that the slope of rating over cocoa percent lies between -2.96 and -1.35. Since the estimate of -2.15 falls within the confidence interval, the result is practically significant.
We are 95% confident the intercept of ratings for bean type “Forastero”, or the predicted rating for a chocolate bar with cocoa percent 0, lies between -0.422 and -0.0560. Since the estimate of -0.239 falls within the confidence interval, the result is practically significant. We can generalize this to say that all estimates for ratings of chocolate bars with bean type “Forastero” are practically significant. Since we are using a parallel slopes model, all ratings will be 0.239 less than the baseline ratings.
Setting up the hypothesis,
\[ \begin{align} & H_0: \beta_0 = 0\\ \text{vs } & H_A : \beta_0 \neq 0 \end{align} \]
Taking an alpha significance value of 0.05, The p value for the intercept of the line of best fit for the Forastero bean type is 0.011. Since the p value is less than alpha, we reject the null hypothesis. So we can say that there is a statistically significant relationship between rating and cocoa percent and bean type.
Setting up the hypothesis \[ \begin{align} & H_0: \beta_1 = 0\\ \text{vs } & H_A : \beta_1 \neq 0 \end{align} \] Taking an alpha significance value of 0.05, The p value for the slope of rating over cocoa percent for each line of best fit in the parallel slopes model is 0.000. Since the p value is less than alpha, we reject the null hypothesis. So we can say that there is a statistically significant relationship between rating and cocoa percent and bean type.
Note: Here, we take an alpha significance value of 0.05. We take this value, because chocolate ratings are completely subjective. To account for this subjectivity, we are taking a liberal value.
Based on our analysis, we can conclude that for all chocolate bars made with one of the five bean types we have included in our data, the ratings for these chocolate bars decrease as the chocolate becomes more bitter, or as the cocoa percent in the chocolate bar increases. From this, we can take away that people prefer sweeter over bitter chocolate for bars made from these five bean types.
Limitations and caveats:
In the future, we can refine our analysis by including more bean types than just the five we have included here. In addition, since we know from our residual analysis that our linear model does not fit the data, we would run another, more fitting model to conduct a better analysis.
alt text
This is a graph imported from Kaggle which tells where the best cocoa beans are grown in the world!