Final project

data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")

library(ggthemes)
library(ggrepel)

## Loading required package: ggplot2

library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggplot2)

summary(data)

##       ref       company_manufacturer company_location    review_date  
##  Min.   :   5   Length:2530          Length:2530        Min.   :2006  
##  1st Qu.: 802   Class :character     Class :character   1st Qu.:2012  
##  Median :1454   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1430                                           Mean   :2014  
##  3rd Qu.:2079                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:2530            Length:2530                      Min.   :0.4200  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7164  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:2530        Length:2530                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.196  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000

head(data )

##    ref company_manufacturer company_location review_date country_of_bean_origin
## 1 2454                 5150           U.S.A.        2019               Tanzania
## 2 2458                 5150           U.S.A.        2019     Dominican Republic
## 3 2454                 5150           U.S.A.        2019             Madagascar
## 4 2542                 5150           U.S.A.        2021                   Fiji
## 5 2546                 5150           U.S.A.        2021              Venezuela
## 6 2546                 5150           U.S.A.        2021                 Uganda
##   specific_bean_origin_or_bar_name cocoa_percent ingredients
## 1            Kokoa Kamili, batch 1          0.76    3- B,S,C
## 2                  Zorzal, batch 1          0.76    3- B,S,C
## 3           Bejofo Estate, batch 1          0.76    3- B,S,C
## 4            Matasawalevu, batch 1          0.68    3- B,S,C
## 5            Sur del Lago, batch 1          0.72    3- B,S,C
## 6         Semuliki Forest, batch 1          0.80    3- B,S,C
##      most_memorable_characteristics rating
## 1         rich cocoa, fatty, bready   3.25
## 2            cocoa, vegetal, savory   3.50
## 3      cocoa, blackberry, full body   3.75
## 4               chewy, off, rubbery   3.00
## 5 fatty, earthy, moss, nutty,chalky   3.00
## 6 mildly bitter, basic cocoa, fatty   3.25

1) Rating vs cocoa_percent
We could investigate whether there’s a correlation between the cocoa percentage in chocolate and the rating given in the review. This analysis could provide insight into whether higher cocoa percentages generally lead to higher ratings, or if there’s an optimal percentage range for better reviews.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Filter relevant columns
relevant_data <- data %>%
  select(cocoa_percent, rating)

# Calculate correlation coefficient
correlation <- cor(relevant_data$cocoa_percent, relevant_data$rating)

# Print correlation coefficient
print(paste("Correlation coefficient:", correlation))

## [1] "Correlation coefficient: -0.146689595080347"

The correlation coefficient of approximately -0.147 suggests a weak negative correlation between cocoa percentage and rating. This means that as cocoa percentage increases, the rating tends to slightly decrease, but the relationship is not very strong.

Visualization

library(ggplot2)

# Create a scatter plot
ggplot(data, aes(x = cocoa_percent, y = rating)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Add linear regression line
  labs(title = "Relationship between Cocoa Percentage and Rating",
       x = "Cocoa Percentage",
       y = "Rating")

## `geom_smooth()` using formula = 'y ~ x'

Data Point Distribution: The data points in the scatter plot appear to be somewhat spread out, indicating a wider range of both cocoa percentages and ratings. This suggests there might not be an extremely strong linear relationship between the two variables.

Regression Line:

The blue regression line in the plot has a very slight negative slope. This suggests a very weak tendency for higher cocoa percentage to be associated with slightly lower ratings. However, the spread of the data points around the line indicates that this is not a definitive trend, and there are many exceptions.

Overall, the data suggests no strong linear relationship between cocoa percentage and rating in this data set. While there might be a very slight negative association according to the regression line, the data points are scattered, and the strength of this association is likely weak.

2) Regional Variations in Chocolate Characteristics

We could analyze how chocolate characteristics vary based on the specific bean origin or country of origin. This could involve comparing the most memorable characteristics mentioned in reviews across different regions to identify any patterns or preferences.

library(dplyr)
library(tidyr)
library(ggplot2)

# Data Preparation
relevant_data <- data %>%
  select(country_of_bean_origin, most_memorable_characteristics)

# Characteristics Extraction (Assuming characteristics are separated by commas)
relevant_data$most_memorable_characteristics <- strsplit(as.character(relevant_data$most_memorable_characteristics), ",\\s*")

# Unnest the characteristics into separate rows
unnested_data <- unnest(relevant_data, most_memorable_characteristics)

# Aggregation by country of bean origin
aggregated_data <- unnested_data %>%
  group_by(country_of_bean_origin, most_memorable_characteristics) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(country_of_bean_origin, desc(count))

# Filter out less common characteristics 
aggregated_data <- aggregated_data %>%
  filter(count > 10)  # Adjust threshold as needed

# Reorder bars within each facet based on frequency
aggregated_data <- aggregated_data %>%
  group_by(country_of_bean_origin) %>%
  mutate(most_memorable_characteristics = reorder(most_memorable_characteristics, count))

# Visualization
ggplot(aggregated_data, aes(x = most_memorable_characteristics, y = count, fill = country_of_bean_origin)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ country_of_bean_origin, scales = "free_y", ncol = 2) +
  labs(title = "Most Memorable Characteristics of Chocolates by Country of Bean Origin",
       x = "Most Memorable Characteristics",
       y = "Frequency",
       fill = "Country of Bean Origin") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Dominant Characteristics by Country:

Ecuador: The most frequent characteristic associated with Ecuadorian chocolate is “fruity,” followed by “floral” and “acidic.” This suggests that Ecuadorean chocolates are often known for their bright and tangy flavors. Dominican Republic: The most common characteristic for Dominican Republic chocolate is “nutty,” followed by “bourbon” and “earthy.” This indicates that chocolates from this origin are likely known for their smooth texture and nutty taste profile. Madagascar: For Madagascar, the dominant characteristic is “fruity,” followed by “smoke” and “sandy.” Similar to Ecuador, Malagasy chocolates seem to be recognized for their fruity notes, but also for complexity in flavor. Peru: “Fruity” is also the most frequent characteristic for peru chocolate, followed by “cocoa.” This suggests peru chocolates are known for their prominent fruit flavors alongside a strong chocolate taste.

3) Temporal Trends

We could examine how chocolate preferences and ratings have evolved over time by analyzing reviews from different years. This could involve identifying any trends or shifts in consumer preferences, such as changes in preferred cocoa percentages or flavor profiles.

library(dplyr)
library(ggplot2)

# Data Preparation
relevant_data <-data %>%
  select(review_date, cocoa_percent, rating)

# Convert review_date to numeric year
relevant_data$review_year <- as.numeric(as.character(relevant_data$review_date))

# Aggregation by review year
aggregated_data <- relevant_data %>%
  group_by(review_year) %>%
  summarise(mean_cocoa_percent = mean(cocoa_percent, na.rm = TRUE),
            mean_rating = mean(rating, na.rm = TRUE))

# Visualization (Line plot)
ggplot(aggregated_data, aes(x = review_year)) +
  geom_line(aes(y = mean_cocoa_percent, color = "Mean Cocoa Percentage")) +
  geom_line(aes(y = mean_rating, color = "Mean Rating")) +
  labs(title = "Temporal Trends in Chocolate Preferences and Ratings",
       x = "Review Year",
       y = "Mean Value",
       color = "Variable") +
  scale_color_manual(values = c("Mean Cocoa Percentage" = "blue", "Mean Rating" = "red")) +
  theme_minimal()

Cocoa Percentage Trend:

The blue line representing the mean cocoa percentage does fluctuate somewhat, but there’s no clear upward or downward trend over the review years. It’s difficult to discern a consistent increase in average cocoa percentage.

Rating Trend:

The red line representing the mean rating also fluctuates, and there’s no clear positive or negative trend. It seems to be relatively flat or with minor variations around a central value.

Relationship Between Trends:

Given the absence of clear trends in both cocoa percentage and rating, it’s difficult to establish a definitive relationship between them.

Possible Explanations:

Consumer preferences for cocoa percentage might vary depending on other chocolate characteristics or individual taste. Some might prefer higher cocoa percentages for a more intense chocolate experience, while others might enjoy milkier chocolates regardless of cocoa content.
Chocolate manufacturers might be offering a wider variety of chocolates with different cocoa percentages and taste profiles to cater to diverse preferences. This could explain the fluctuations in both cocoa percentage and rating without a clear correlation.

Additional Considerations:

As mentioned before, the specific range of years and more data points would be valuable for a more robust analysis.
It might be helpful to explore if there are subgroups within the data (e.g., by chocolate type, origin, or brand) that exhibit different trends. This could reveal more specific patterns in consumer preferences.

Overall, the visualization suggests that there might not be a simple relationship between cocoa percentage and rating in this data set. Both factors seem to fluctuate independently, possibly reflecting the diverse preferences of chocolate consumers and the variety of chocolate products available.

Hypothesis Testing

Null Hypothesis: There is no difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%. Alternative Hypothesis: There is a difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

For this hypothesis, we will use a two-sample t-test to compare the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

Neyman-Pearson Framework:

Test: Two-sample t-test
Alpha level: 0.05
Power level : 0.2 (power = 0.8)

# Subset the data into two groups based on cocoa percentage
above_70 <- data[data$cocoa_percent > 0.7, ]
below_or_equal_70 <-data[data$cocoa_percent <= 0.7, ]

# Perform two-sample t-test
t_test_result <- t.test(above_70$rating, below_or_equal_70$rating)

# Print the result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  above_70$rating and below_or_equal_70$rating
## t = -6.2712, df = 2257.1, p-value = 4.279e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14750859 -0.07723218
## sample estimates:
## mean of x mean of y 
##  3.132741  3.245112

The test statistic (t) is -6.2712.
The degrees of freedom (df) are approximately 2257.1.
The p-value is very small (4.279e-10), indicating strong evidence against the null hypothesis.
The 95% confidence interval for the difference in means ranges from -0.1475 to -0.0772.
The sample mean rating for chocolate bars with a cocoa percentage above 70% is approximately 3.133.
The sample mean rating for chocolate bars with a cocoa percentage below or equal to 70% is approximately 3.245. Since the p-value is less than the chosen significance level (alpha = 0.05), we reject the null hypothesis and conclude that there is a statistically significant difference in mean ratings between the two groups. Additionally, the negative t-value indicates that the mean rating for chocolate bars with a cocoa percentage above 70% is significantly lower than that for bars with a cocoa percentage below or equal to 70%.

Visualization to illustrate the results of hypothesis

# Create a box plot with color
ggplot(data = data, aes(x = factor(cocoa_percent > 0.7), y = rating, fill = factor(cocoa_percent > 0.7))) +
  geom_boxplot(fill = c("lightblue", "lightgreen")) +
  labs(x = "Cocoa Percentage (Above 70%)", y = "Rating", fill = "Cocoa Percentage") +
  theme_minimal() +
  geom_hline(yintercept = t_test_result$conf.int[1], linetype = "dashed", color = "red") +
  geom_hline(yintercept = t_test_result$conf.int[2], linetype = "dashed", color = "red") +
  ggtitle("Comparison of Ratings for Chocolate Bars") +
  theme(plot.title = element_text(hjust = 0.5))

The x-axis represents whether the cocoa percentage is above 70% or below/equal to 70%.
The y-axis represents the rating of chocolate bars.
The box plots show the distribution of ratings for each group.
The dashed red lines represent the 95% confidence interval for the difference in means.
Bars colored in light blue represent chocolate bars with a cocoa percentage below or equal to 70%, and bars colored in light green represent bars with a cocoa percentage above 70%.

Linear regression

# Build the linear regression model
model <- lm(rating ~ cocoa_percent, data = data)

summary(model)

## 
## Call:
## lm(formula = rating ~ cocoa_percent, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21541 -0.23867  0.03459  0.28459  0.99393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.0295     0.1121  35.949  < 2e-16 ***
## cocoa_percent  -1.1630     0.1560  -7.456 1.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4406 on 2528 degrees of freedom
## Multiple R-squared:  0.02152,    Adjusted R-squared:  0.02113 
## F-statistic: 55.59 on 1 and 2528 DF,  p-value: 1.218e-13

The projected rating when cocoa_percent is 0 is shown by the intercept, which is 4.0295. Nevertheless, this intercept might not have a useful interpretation in this situation because cocoa_percent cannot be less than 0. -Cocoa_percent has a coefficient of -1.1630. This indicates that we anticipate a rating loss of about 1.1630 units for every unit rise in cocoa_percent. -Because the cocoa_percent intercept and coefficient are both statistically significant, it is improbable that they will be zero.

Extended Linear Model

# Build the extended linear regression model
model_extended <- lm(rating ~ cocoa_percent  + review_date + ingredients, data = data)

#summary 
summary(model_extended)

## 
## Call:
## lm(formula = rating ~ cocoa_percent + review_date + ingredients, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03025 -0.27248  0.00066  0.27472  1.12161 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -3.425339   4.830579  -0.709 0.478332    
## cocoa_percent             -1.155747   0.160627  -7.195 8.20e-13 ***
## review_date                0.003535   0.002399   1.474 0.140741    
## ingredients1- B            0.414298   0.184247   2.249 0.024624 *  
## ingredients2- B,C          0.482479   0.429995   1.122 0.261946    
## ingredients2- B,S          0.362282   0.049684   7.292 4.08e-13 ***
## ingredients2- B,S*         0.099077   0.089983   1.101 0.270975    
## ingredients3- B,S*,C       0.059607   0.131780   0.452 0.651076    
## ingredients3- B,S*,Sa     -0.324134   0.428185  -0.757 0.449124    
## ingredients3- B,S,C        0.408212   0.048966   8.337  < 2e-16 ***
## ingredients3- B,S,L       -0.148310   0.157343  -0.943 0.345981    
## ingredients3- B,S,V        0.318557   0.250128   1.274 0.202933    
## ingredients4- B,S*,C,L     0.019729   0.304534   0.065 0.948351    
## ingredients4- B,S*,C,Sa    0.230198   0.106023   2.171 0.030009 *  
## ingredients4- B,S*,C,V     0.133851   0.167513   0.799 0.424338    
## ingredients4- B,S*,V,L     0.255664   0.250083   1.022 0.306729    
## ingredients4- B,S,C,L      0.335705   0.053063   6.327 2.96e-10 ***
## ingredients4- B,S,C,Sa     0.250620   0.196169   1.278 0.201520    
## ingredients4- B,S,C,V      0.112655   0.058501   1.926 0.054257 .  
## ingredients4- B,S,V,L     -0.018008   0.196335  -0.092 0.926928    
## ingredients5- B,S,C,L,Sa   0.071849   0.428461   0.168 0.866840    
## ingredients5- B,S,C,V,L    0.204217   0.056593   3.609 0.000314 ***
## ingredients5-B,S,C,V,Sa   -0.055265   0.179851  -0.307 0.758654    
## ingredients6-B,S,C,V,L,Sa  0.074751   0.218020   0.343 0.731732    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4257 on 2506 degrees of freedom
## Multiple R-squared:  0.09439,    Adjusted R-squared:  0.08608 
## F-statistic: 11.36 on 23 and 2506 DF,  p-value: < 2.2e-16

Model Diagnosis

gg_resfitted(model_extended) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

residual_plots <- gg_resX(model_extended)

gg_reshist(model_extended)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_qqplot(model_extended)

plot(cooks.distance(model_extended))

Conclusions and Practical scenario recommendations

Rating vs cocoa_percent

The analysis reveals a weak negative correlation between cocoa percentage and chocolate ratings, suggesting that higher cocoa percentages do not consistently lead to higher ratings in reviews.
Chocolate manufacturers should not solely rely on increasing cocoa percentage to improve ratings, as other factors may also influence consumer preferences.

Regional Variations in Chocolate Characteristics
Based on the visualization, fruits appear to be a dominant characteristic across chocolates from Ecuador, Madagascar, and peru, potentially suggesting a preference for fruity notes in chocolates from these regions.
Manufacturers sourcing beans from regions like Ecuador, Madagascar, and Peru may consider incorporating fruity notes into their chocolate products to cater to potential regional preferences.

Temporal Trends

Based on the visualization, there is no clear trend between cocoa percentage and rating, suggesting consumer preferences for these factors may be varied.
The lack of a clear trend between cocoa percentage and rating suggests that consumer preferences for these factors may vary over time. Continuous monitoring and adaptation to changing consumer preferences are recommended.

Hypothesis Testing

The statistical analysis indicates a significant difference in mean ratings between chocolate bars with cocoa percentages above and below 70%, with bars containing higher cocoa percentages receiving lower ratings on average.
Chocolate bars with cocoa percentages above and below 70% may cater to different consumer segments, and manufacturers should consider diversifying their product offerings to accommodate these preferences.

Linear regression

The linear regression analysis suggests that as cocoa percentage increases, the anticipated chocolate rating decreases, with both the intercept and coefficient being statistically significant.
Manufacturers can anticipate a decrease in chocolate ratings as cocoa percentage increases, highlighting the importance of balancing cocoa percentage with other flavor and texture factors.

Extended Linear Model

In conclusion, cocoa percentage and certain ingredient combinations appear to be significant predictors of the product rating, while the review date has a relatively weaker association. However, the model only explains a small portion of the variability in the rating, suggesting that other factors not included in the model may also play a role.

In practical scenarios, chocolate manufacturers should conduct thorough market research to understand consumer preferences in different regions and time periods. They should also consider diversifying their product offerings to cater to varying consumer preferences regarding cocoa percentage and flavor profiles. Additionally, ongoing experimentation and innovation are essential to meet evolving consumer tastes and preferences in the chocolate industry.

Final project

Final Project

2024-04-22

Visualization

2) Regional Variations in Chocolate Characteristics

3) Temporal Trends

Hypothesis Testing

Neyman-Pearson Framework:

Visualization to illustrate the results of hypothesis

Linear regression

Extended Linear Model

Conclusions and Practical scenario recommendations

Rating vs cocoa_percent

Regional Variations in Chocolate Characteristics

Temporal Trends

Hypothesis Testing

Linear regression

Extended Linear Model