data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggplot2)
summary(data)
##       ref       company_manufacturer company_location    review_date  
##  Min.   :   5   Length:2530          Length:2530        Min.   :2006  
##  1st Qu.: 802   Class :character     Class :character   1st Qu.:2012  
##  Median :1454   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1430                                           Mean   :2014  
##  3rd Qu.:2079                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:2530            Length:2530                      Min.   :0.4200  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7164  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:2530        Length:2530                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.196  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000
head(data )
##    ref company_manufacturer company_location review_date country_of_bean_origin
## 1 2454                 5150           U.S.A.        2019               Tanzania
## 2 2458                 5150           U.S.A.        2019     Dominican Republic
## 3 2454                 5150           U.S.A.        2019             Madagascar
## 4 2542                 5150           U.S.A.        2021                   Fiji
## 5 2546                 5150           U.S.A.        2021              Venezuela
## 6 2546                 5150           U.S.A.        2021                 Uganda
##   specific_bean_origin_or_bar_name cocoa_percent ingredients
## 1            Kokoa Kamili, batch 1          0.76    3- B,S,C
## 2                  Zorzal, batch 1          0.76    3- B,S,C
## 3           Bejofo Estate, batch 1          0.76    3- B,S,C
## 4            Matasawalevu, batch 1          0.68    3- B,S,C
## 5            Sur del Lago, batch 1          0.72    3- B,S,C
## 6         Semuliki Forest, batch 1          0.80    3- B,S,C
##      most_memorable_characteristics rating
## 1         rich cocoa, fatty, bready   3.25
## 2            cocoa, vegetal, savory   3.50
## 3      cocoa, blackberry, full body   3.75
## 4               chewy, off, rubbery   3.00
## 5 fatty, earthy, moss, nutty,chalky   3.00
## 6 mildly bitter, basic cocoa, fatty   3.25

1) Rating vs cocoa_percent
We could investigate whether there’s a correlation between the cocoa percentage in chocolate and the rating given in the review. This analysis could provide insight into whether higher cocoa percentages generally lead to higher ratings, or if there’s an optimal percentage range for better reviews.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Filter relevant columns
relevant_data <- data %>%
  select(cocoa_percent, rating)

# Calculate correlation coefficient
correlation <- cor(relevant_data$cocoa_percent, relevant_data$rating)

# Print correlation coefficient
print(paste("Correlation coefficient:", correlation))
## [1] "Correlation coefficient: -0.146689595080347"

The correlation coefficient of approximately -0.147 suggests a weak negative correlation between cocoa percentage and rating. This means that as cocoa percentage increases, the rating tends to slightly decrease, but the relationship is not very strong.

Visualization

library(ggplot2)

# Create a scatter plot
ggplot(data, aes(x = cocoa_percent, y = rating)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Add linear regression line
  labs(title = "Relationship between Cocoa Percentage and Rating",
       x = "Cocoa Percentage",
       y = "Rating")
## `geom_smooth()` using formula = 'y ~ x'

Data Point Distribution: The data points in the scatter plot appear to be somewhat spread out, indicating a wider range of both cocoa percentages and ratings. This suggests there might not be an extremely strong linear relationship between the two variables.

Regression Line:

The blue regression line in the plot has a very slight negative slope. This suggests a very weak tendency for higher cocoa percentage to be associated with slightly lower ratings. However, the spread of the data points around the line indicates that this is not a definitive trend, and there are many exceptions.

Overall, the data suggests no strong linear relationship between cocoa percentage and rating in this data set. While there might be a very slight negative association according to the regression line, the data points are scattered, and the strength of this association is likely weak.

2) Regional Variations in Chocolate Characteristics

We could analyze how chocolate characteristics vary based on the specific bean origin or country of origin. This could involve comparing the most memorable characteristics mentioned in reviews across different regions to identify any patterns or preferences.

library(dplyr)
library(tidyr)
library(ggplot2)

# Data Preparation
relevant_data <- data %>%
  select(country_of_bean_origin, most_memorable_characteristics)

# Characteristics Extraction (Assuming characteristics are separated by commas)
relevant_data$most_memorable_characteristics <- strsplit(as.character(relevant_data$most_memorable_characteristics), ",\\s*")

# Unnest the characteristics into separate rows
unnested_data <- unnest(relevant_data, most_memorable_characteristics)

# Aggregation by country of bean origin
aggregated_data <- unnested_data %>%
  group_by(country_of_bean_origin, most_memorable_characteristics) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(country_of_bean_origin, desc(count))
# Filter out less common characteristics 
aggregated_data <- aggregated_data %>%
  filter(count > 10)  # Adjust threshold as needed

# Reorder bars within each facet based on frequency
aggregated_data <- aggregated_data %>%
  group_by(country_of_bean_origin) %>%
  mutate(most_memorable_characteristics = reorder(most_memorable_characteristics, count))

# Visualization
ggplot(aggregated_data, aes(x = most_memorable_characteristics, y = count, fill = country_of_bean_origin)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ country_of_bean_origin, scales = "free_y", ncol = 2) +
  labs(title = "Most Memorable Characteristics of Chocolates by Country of Bean Origin",
       x = "Most Memorable Characteristics",
       y = "Frequency",
       fill = "Country of Bean Origin") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  

Dominant Characteristics by Country:

Ecuador: The most frequent characteristic associated with Ecuadorian chocolate is “fruity,” followed by “floral” and “acidic.” This suggests that Ecuadorean chocolates are often known for their bright and tangy flavors. Dominican Republic: The most common characteristic for Dominican Republic chocolate is “nutty,” followed by “bourbon” and “earthy.” This indicates that chocolates from this origin are likely known for their smooth texture and nutty taste profile. Madagascar: For Madagascar, the dominant characteristic is “fruity,” followed by “smoke” and “sandy.” Similar to Ecuador, Malagasy chocolates seem to be recognized for their fruity notes, but also for complexity in flavor. Peru: “Fruity” is also the most frequent characteristic for peru chocolate, followed by “cocoa.” This suggests peru chocolates are known for their prominent fruit flavors alongside a strong chocolate taste.

Hypothesis Testing

Null Hypothesis: There is no difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%. Alternative Hypothesis: There is a difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

For this hypothesis, we will use a two-sample t-test to compare the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

Neyman-Pearson Framework:

  • Test: Two-sample t-test

  • Alpha level: 0.05

  • Power level : 0.2 (power = 0.8)

# Subset the data into two groups based on cocoa percentage
above_70 <- data[data$cocoa_percent > 0.7, ]
below_or_equal_70 <-data[data$cocoa_percent <= 0.7, ]

# Perform two-sample t-test
t_test_result <- t.test(above_70$rating, below_or_equal_70$rating)

# Print the result
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  above_70$rating and below_or_equal_70$rating
## t = -6.2712, df = 2257.1, p-value = 4.279e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14750859 -0.07723218
## sample estimates:
## mean of x mean of y 
##  3.132741  3.245112
  1. The test statistic (t) is -6.2712.

  2. The degrees of freedom (df) are approximately 2257.1.

  3. The p-value is very small (4.279e-10), indicating strong evidence against the null hypothesis.

  4. The 95% confidence interval for the difference in means ranges from -0.1475 to -0.0772.

  5. The sample mean rating for chocolate bars with a cocoa percentage above 70% is approximately 3.133.

  6. The sample mean rating for chocolate bars with a cocoa percentage below or equal to 70% is approximately 3.245. Since the p-value is less than the chosen significance level (alpha = 0.05), we reject the null hypothesis and conclude that there is a statistically significant difference in mean ratings between the two groups. Additionally, the negative t-value indicates that the mean rating for chocolate bars with a cocoa percentage above 70% is significantly lower than that for bars with a cocoa percentage below or equal to 70%.

Visualization to illustrate the results of hypothesis

# Create a box plot with color
ggplot(data = data, aes(x = factor(cocoa_percent > 0.7), y = rating, fill = factor(cocoa_percent > 0.7))) +
  geom_boxplot(fill = c("lightblue", "lightgreen")) +
  labs(x = "Cocoa Percentage (Above 70%)", y = "Rating", fill = "Cocoa Percentage") +
  theme_minimal() +
  geom_hline(yintercept = t_test_result$conf.int[1], linetype = "dashed", color = "red") +
  geom_hline(yintercept = t_test_result$conf.int[2], linetype = "dashed", color = "red") +
  ggtitle("Comparison of Ratings for Chocolate Bars") +
  theme(plot.title = element_text(hjust = 0.5))

  1. The x-axis represents whether the cocoa percentage is above 70% or below/equal to 70%.

  2. The y-axis represents the rating of chocolate bars.

  3. The box plots show the distribution of ratings for each group.

  4. The dashed red lines represent the 95% confidence interval for the difference in means.

  5. Bars colored in light blue represent chocolate bars with a cocoa percentage below or equal to 70%, and bars colored in light green represent bars with a cocoa percentage above 70%.

Linear regression

# Build the linear regression model
model <- lm(rating ~ cocoa_percent, data = data)

summary(model)
## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21541 -0.23867  0.03459  0.28459  0.99393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.0295     0.1121  35.949  < 2e-16 ***
## cocoa_percent  -1.1630     0.1560  -7.456 1.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4406 on 2528 degrees of freedom
## Multiple R-squared:  0.02152,    Adjusted R-squared:  0.02113 
## F-statistic: 55.59 on 1 and 2528 DF,  p-value: 1.218e-13

The projected rating when cocoa_percent is 0 is shown by the intercept, which is 4.0295. Nevertheless, this intercept might not have a useful interpretation in this situation because cocoa_percent cannot be less than 0. -Cocoa_percent has a coefficient of -1.1630. This indicates that we anticipate a rating loss of about 1.1630 units for every unit rise in cocoa_percent. -Because the cocoa_percent intercept and coefficient are both statistically significant, it is improbable that they will be zero.

Extended Linear Model

# Build the extended linear regression model
model_extended <- lm(rating ~ cocoa_percent  + review_date + ingredients, data = data)

#summary 
summary(model_extended)
## 
## Call:
## lm(formula = rating ~ cocoa_percent + review_date + ingredients, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03025 -0.27248  0.00066  0.27472  1.12161 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -3.425339   4.830579  -0.709 0.478332    
## cocoa_percent             -1.155747   0.160627  -7.195 8.20e-13 ***
## review_date                0.003535   0.002399   1.474 0.140741    
## ingredients1- B            0.414298   0.184247   2.249 0.024624 *  
## ingredients2- B,C          0.482479   0.429995   1.122 0.261946    
## ingredients2- B,S          0.362282   0.049684   7.292 4.08e-13 ***
## ingredients2- B,S*         0.099077   0.089983   1.101 0.270975    
## ingredients3- B,S*,C       0.059607   0.131780   0.452 0.651076    
## ingredients3- B,S*,Sa     -0.324134   0.428185  -0.757 0.449124    
## ingredients3- B,S,C        0.408212   0.048966   8.337  < 2e-16 ***
## ingredients3- B,S,L       -0.148310   0.157343  -0.943 0.345981    
## ingredients3- B,S,V        0.318557   0.250128   1.274 0.202933    
## ingredients4- B,S*,C,L     0.019729   0.304534   0.065 0.948351    
## ingredients4- B,S*,C,Sa    0.230198   0.106023   2.171 0.030009 *  
## ingredients4- B,S*,C,V     0.133851   0.167513   0.799 0.424338    
## ingredients4- B,S*,V,L     0.255664   0.250083   1.022 0.306729    
## ingredients4- B,S,C,L      0.335705   0.053063   6.327 2.96e-10 ***
## ingredients4- B,S,C,Sa     0.250620   0.196169   1.278 0.201520    
## ingredients4- B,S,C,V      0.112655   0.058501   1.926 0.054257 .  
## ingredients4- B,S,V,L     -0.018008   0.196335  -0.092 0.926928    
## ingredients5- B,S,C,L,Sa   0.071849   0.428461   0.168 0.866840    
## ingredients5- B,S,C,V,L    0.204217   0.056593   3.609 0.000314 ***
## ingredients5-B,S,C,V,Sa   -0.055265   0.179851  -0.307 0.758654    
## ingredients6-B,S,C,V,L,Sa  0.074751   0.218020   0.343 0.731732    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4257 on 2506 degrees of freedom
## Multiple R-squared:  0.09439,    Adjusted R-squared:  0.08608 
## F-statistic: 11.36 on 23 and 2506 DF,  p-value: < 2.2e-16

Model Diagnosis

gg_resfitted(model_extended) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

residual_plots <- gg_resX(model_extended)

gg_reshist(model_extended)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_qqplot(model_extended)

plot(cooks.distance(model_extended))

Conclusions and Practical scenario recommendations

Rating vs cocoa_percent

  • The analysis reveals a weak negative correlation between cocoa percentage and chocolate ratings, suggesting that higher cocoa percentages do not consistently lead to higher ratings in reviews.

  • Chocolate manufacturers should not solely rely on increasing cocoa percentage to improve ratings, as other factors may also influence consumer preferences.

    Regional Variations in Chocolate Characteristics

  • Based on the visualization, fruits appear to be a dominant characteristic across chocolates from Ecuador, Madagascar, and peru, potentially suggesting a preference for fruity notes in chocolates from these regions.

  • Manufacturers sourcing beans from regions like Ecuador, Madagascar, and Peru may consider incorporating fruity notes into their chocolate products to cater to potential regional preferences.

Hypothesis Testing

  • The statistical analysis indicates a significant difference in mean ratings between chocolate bars with cocoa percentages above and below 70%, with bars containing higher cocoa percentages receiving lower ratings on average.

  • Chocolate bars with cocoa percentages above and below 70% may cater to different consumer segments, and manufacturers should consider diversifying their product offerings to accommodate these preferences.

Linear regression

  • The linear regression analysis suggests that as cocoa percentage increases, the anticipated chocolate rating decreases, with both the intercept and coefficient being statistically significant.

  • Manufacturers can anticipate a decrease in chocolate ratings as cocoa percentage increases, highlighting the importance of balancing cocoa percentage with other flavor and texture factors.

    Extended Linear Model

    In conclusion, cocoa percentage and certain ingredient combinations appear to be significant predictors of the product rating, while the review date has a relatively weaker association. However, the model only explains a small portion of the variability in the rating, suggesting that other factors not included in the model may also play a role.

    In practical scenarios, chocolate manufacturers should conduct thorough market research to understand consumer preferences in different regions and time periods. They should also consider diversifying their product offerings to cater to varying consumer preferences regarding cocoa percentage and flavor profiles. Additionally, ongoing experimentation and innovation are essential to meet evolving consumer tastes and preferences in the chocolate industry.