data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggplot2)
summary(data)
## ref company_manufacturer company_location review_date
## Min. : 5 Length:2530 Length:2530 Min. :2006
## 1st Qu.: 802 Class :character Class :character 1st Qu.:2012
## Median :1454 Mode :character Mode :character Median :2015
## Mean :1430 Mean :2014
## 3rd Qu.:2079 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:2530 Length:2530 Min. :0.4200
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7164
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:2530 Length:2530 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.196
## 3rd Qu.:3.500
## Max. :4.000
head(data )
## ref company_manufacturer company_location review_date country_of_bean_origin
## 1 2454 5150 U.S.A. 2019 Tanzania
## 2 2458 5150 U.S.A. 2019 Dominican Republic
## 3 2454 5150 U.S.A. 2019 Madagascar
## 4 2542 5150 U.S.A. 2021 Fiji
## 5 2546 5150 U.S.A. 2021 Venezuela
## 6 2546 5150 U.S.A. 2021 Uganda
## specific_bean_origin_or_bar_name cocoa_percent ingredients
## 1 Kokoa Kamili, batch 1 0.76 3- B,S,C
## 2 Zorzal, batch 1 0.76 3- B,S,C
## 3 Bejofo Estate, batch 1 0.76 3- B,S,C
## 4 Matasawalevu, batch 1 0.68 3- B,S,C
## 5 Sur del Lago, batch 1 0.72 3- B,S,C
## 6 Semuliki Forest, batch 1 0.80 3- B,S,C
## most_memorable_characteristics rating
## 1 rich cocoa, fatty, bready 3.25
## 2 cocoa, vegetal, savory 3.50
## 3 cocoa, blackberry, full body 3.75
## 4 chewy, off, rubbery 3.00
## 5 fatty, earthy, moss, nutty,chalky 3.00
## 6 mildly bitter, basic cocoa, fatty 3.25
1) Rating vs cocoa_percent
We could investigate whether there’s a correlation between the
cocoa percentage in chocolate and the rating given in the review. This
analysis could provide insight into whether higher cocoa percentages
generally lead to higher ratings, or if there’s an optimal percentage
range for better reviews.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Filter relevant columns
relevant_data <- data %>%
select(cocoa_percent, rating)
# Calculate correlation coefficient
correlation <- cor(relevant_data$cocoa_percent, relevant_data$rating)
# Print correlation coefficient
print(paste("Correlation coefficient:", correlation))
## [1] "Correlation coefficient: -0.146689595080347"
The correlation coefficient of approximately -0.147 suggests a weak negative correlation between cocoa percentage and rating. This means that as cocoa percentage increases, the rating tends to slightly decrease, but the relationship is not very strong.
library(ggplot2)
# Create a scatter plot
ggplot(data, aes(x = cocoa_percent, y = rating)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add linear regression line
labs(title = "Relationship between Cocoa Percentage and Rating",
x = "Cocoa Percentage",
y = "Rating")
## `geom_smooth()` using formula = 'y ~ x'
Data Point Distribution: The data points in the scatter plot appear to be somewhat spread out, indicating a wider range of both cocoa percentages and ratings. This suggests there might not be an extremely strong linear relationship between the two variables.
Regression Line:
The blue regression line in the plot has a very slight negative slope. This suggests a very weak tendency for higher cocoa percentage to be associated with slightly lower ratings. However, the spread of the data points around the line indicates that this is not a definitive trend, and there are many exceptions.
Overall, the data suggests no strong linear relationship between cocoa percentage and rating in this data set. While there might be a very slight negative association according to the regression line, the data points are scattered, and the strength of this association is likely weak.
We could analyze how chocolate characteristics vary based on the specific bean origin or country of origin. This could involve comparing the most memorable characteristics mentioned in reviews across different regions to identify any patterns or preferences.
library(dplyr)
library(tidyr)
library(ggplot2)
# Data Preparation
relevant_data <- data %>%
select(country_of_bean_origin, most_memorable_characteristics)
# Characteristics Extraction (Assuming characteristics are separated by commas)
relevant_data$most_memorable_characteristics <- strsplit(as.character(relevant_data$most_memorable_characteristics), ",\\s*")
# Unnest the characteristics into separate rows
unnested_data <- unnest(relevant_data, most_memorable_characteristics)
# Aggregation by country of bean origin
aggregated_data <- unnested_data %>%
group_by(country_of_bean_origin, most_memorable_characteristics) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(country_of_bean_origin, desc(count))
# Filter out less common characteristics
aggregated_data <- aggregated_data %>%
filter(count > 10) # Adjust threshold as needed
# Reorder bars within each facet based on frequency
aggregated_data <- aggregated_data %>%
group_by(country_of_bean_origin) %>%
mutate(most_memorable_characteristics = reorder(most_memorable_characteristics, count))
# Visualization
ggplot(aggregated_data, aes(x = most_memorable_characteristics, y = count, fill = country_of_bean_origin)) +
geom_bar(stat = "identity") +
facet_wrap(~ country_of_bean_origin, scales = "free_y", ncol = 2) +
labs(title = "Most Memorable Characteristics of Chocolates by Country of Bean Origin",
x = "Most Memorable Characteristics",
y = "Frequency",
fill = "Country of Bean Origin") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Dominant Characteristics by Country:
Ecuador: The most frequent characteristic associated with Ecuadorian chocolate is “fruity,” followed by “floral” and “acidic.” This suggests that Ecuadorean chocolates are often known for their bright and tangy flavors. Dominican Republic: The most common characteristic for Dominican Republic chocolate is “nutty,” followed by “bourbon” and “earthy.” This indicates that chocolates from this origin are likely known for their smooth texture and nutty taste profile. Madagascar: For Madagascar, the dominant characteristic is “fruity,” followed by “smoke” and “sandy.” Similar to Ecuador, Malagasy chocolates seem to be recognized for their fruity notes, but also for complexity in flavor. Peru: “Fruity” is also the most frequent characteristic for peru chocolate, followed by “cocoa.” This suggests peru chocolates are known for their prominent fruit flavors alongside a strong chocolate taste.
We could examine how chocolate preferences and ratings have evolved over time by analyzing reviews from different years. This could involve identifying any trends or shifts in consumer preferences, such as changes in preferred cocoa percentages or flavor profiles.
library(dplyr)
library(ggplot2)
# Data Preparation
relevant_data <-data %>%
select(review_date, cocoa_percent, rating)
# Convert review_date to numeric year
relevant_data$review_year <- as.numeric(as.character(relevant_data$review_date))
# Aggregation by review year
aggregated_data <- relevant_data %>%
group_by(review_year) %>%
summarise(mean_cocoa_percent = mean(cocoa_percent, na.rm = TRUE),
mean_rating = mean(rating, na.rm = TRUE))
# Visualization (Line plot)
ggplot(aggregated_data, aes(x = review_year)) +
geom_line(aes(y = mean_cocoa_percent, color = "Mean Cocoa Percentage")) +
geom_line(aes(y = mean_rating, color = "Mean Rating")) +
labs(title = "Temporal Trends in Chocolate Preferences and Ratings",
x = "Review Year",
y = "Mean Value",
color = "Variable") +
scale_color_manual(values = c("Mean Cocoa Percentage" = "blue", "Mean Rating" = "red")) +
theme_minimal()
Cocoa Percentage Trend:
Rating Trend:
Relationship Between Trends:
Possible Explanations:
Consumer preferences for cocoa percentage might vary depending on other chocolate characteristics or individual taste. Some might prefer higher cocoa percentages for a more intense chocolate experience, while others might enjoy milkier chocolates regardless of cocoa content.
Chocolate manufacturers might be offering a wider variety of chocolates with different cocoa percentages and taste profiles to cater to diverse preferences. This could explain the fluctuations in both cocoa percentage and rating without a clear correlation.
Additional Considerations:
As mentioned before, the specific range of years and more data points would be valuable for a more robust analysis.
It might be helpful to explore if there are subgroups within the data (e.g., by chocolate type, origin, or brand) that exhibit different trends. This could reveal more specific patterns in consumer preferences.
Overall, the visualization suggests that there might not be a simple relationship between cocoa percentage and rating in this data set. Both factors seem to fluctuate independently, possibly reflecting the diverse preferences of chocolate consumers and the variety of chocolate products available.
Null Hypothesis: There is no difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%. Alternative Hypothesis: There is a difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.
For this hypothesis, we will use a two-sample t-test to compare the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.
Test: Two-sample t-test
Alpha level: 0.05
Power level : 0.2 (power = 0.8)
# Subset the data into two groups based on cocoa percentage
above_70 <- data[data$cocoa_percent > 0.7, ]
below_or_equal_70 <-data[data$cocoa_percent <= 0.7, ]
# Perform two-sample t-test
t_test_result <- t.test(above_70$rating, below_or_equal_70$rating)
# Print the result
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: above_70$rating and below_or_equal_70$rating
## t = -6.2712, df = 2257.1, p-value = 4.279e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.14750859 -0.07723218
## sample estimates:
## mean of x mean of y
## 3.132741 3.245112
The test statistic (t) is -6.2712.
The degrees of freedom (df) are approximately 2257.1.
The p-value is very small (4.279e-10), indicating strong evidence against the null hypothesis.
The 95% confidence interval for the difference in means ranges from -0.1475 to -0.0772.
The sample mean rating for chocolate bars with a cocoa percentage above 70% is approximately 3.133.
The sample mean rating for chocolate bars with a cocoa percentage below or equal to 70% is approximately 3.245. Since the p-value is less than the chosen significance level (alpha = 0.05), we reject the null hypothesis and conclude that there is a statistically significant difference in mean ratings between the two groups. Additionally, the negative t-value indicates that the mean rating for chocolate bars with a cocoa percentage above 70% is significantly lower than that for bars with a cocoa percentage below or equal to 70%.
# Create a box plot with color
ggplot(data = data, aes(x = factor(cocoa_percent > 0.7), y = rating, fill = factor(cocoa_percent > 0.7))) +
geom_boxplot(fill = c("lightblue", "lightgreen")) +
labs(x = "Cocoa Percentage (Above 70%)", y = "Rating", fill = "Cocoa Percentage") +
theme_minimal() +
geom_hline(yintercept = t_test_result$conf.int[1], linetype = "dashed", color = "red") +
geom_hline(yintercept = t_test_result$conf.int[2], linetype = "dashed", color = "red") +
ggtitle("Comparison of Ratings for Chocolate Bars") +
theme(plot.title = element_text(hjust = 0.5))
The x-axis represents whether the cocoa percentage is above 70% or below/equal to 70%.
The y-axis represents the rating of chocolate bars.
The box plots show the distribution of ratings for each group.
The dashed red lines represent the 95% confidence interval for the difference in means.
Bars colored in light blue represent chocolate bars with a cocoa percentage below or equal to 70%, and bars colored in light green represent bars with a cocoa percentage above 70%.
# Build the linear regression model
model <- lm(rating ~ cocoa_percent, data = data)
summary(model)
##
## Call:
## lm(formula = rating ~ cocoa_percent, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.21541 -0.23867 0.03459 0.28459 0.99393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0295 0.1121 35.949 < 2e-16 ***
## cocoa_percent -1.1630 0.1560 -7.456 1.22e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4406 on 2528 degrees of freedom
## Multiple R-squared: 0.02152, Adjusted R-squared: 0.02113
## F-statistic: 55.59 on 1 and 2528 DF, p-value: 1.218e-13
The projected rating when cocoa_percent is 0 is shown by the intercept, which is 4.0295. Nevertheless, this intercept might not have a useful interpretation in this situation because cocoa_percent cannot be less than 0. -Cocoa_percent has a coefficient of -1.1630. This indicates that we anticipate a rating loss of about 1.1630 units for every unit rise in cocoa_percent. -Because the cocoa_percent intercept and coefficient are both statistically significant, it is improbable that they will be zero.
# Build the extended linear regression model
model_extended <- lm(rating ~ cocoa_percent + review_date + ingredients, data = data)
#summary
summary(model_extended)
##
## Call:
## lm(formula = rating ~ cocoa_percent + review_date + ingredients,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03025 -0.27248 0.00066 0.27472 1.12161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.425339 4.830579 -0.709 0.478332
## cocoa_percent -1.155747 0.160627 -7.195 8.20e-13 ***
## review_date 0.003535 0.002399 1.474 0.140741
## ingredients1- B 0.414298 0.184247 2.249 0.024624 *
## ingredients2- B,C 0.482479 0.429995 1.122 0.261946
## ingredients2- B,S 0.362282 0.049684 7.292 4.08e-13 ***
## ingredients2- B,S* 0.099077 0.089983 1.101 0.270975
## ingredients3- B,S*,C 0.059607 0.131780 0.452 0.651076
## ingredients3- B,S*,Sa -0.324134 0.428185 -0.757 0.449124
## ingredients3- B,S,C 0.408212 0.048966 8.337 < 2e-16 ***
## ingredients3- B,S,L -0.148310 0.157343 -0.943 0.345981
## ingredients3- B,S,V 0.318557 0.250128 1.274 0.202933
## ingredients4- B,S*,C,L 0.019729 0.304534 0.065 0.948351
## ingredients4- B,S*,C,Sa 0.230198 0.106023 2.171 0.030009 *
## ingredients4- B,S*,C,V 0.133851 0.167513 0.799 0.424338
## ingredients4- B,S*,V,L 0.255664 0.250083 1.022 0.306729
## ingredients4- B,S,C,L 0.335705 0.053063 6.327 2.96e-10 ***
## ingredients4- B,S,C,Sa 0.250620 0.196169 1.278 0.201520
## ingredients4- B,S,C,V 0.112655 0.058501 1.926 0.054257 .
## ingredients4- B,S,V,L -0.018008 0.196335 -0.092 0.926928
## ingredients5- B,S,C,L,Sa 0.071849 0.428461 0.168 0.866840
## ingredients5- B,S,C,V,L 0.204217 0.056593 3.609 0.000314 ***
## ingredients5-B,S,C,V,Sa -0.055265 0.179851 -0.307 0.758654
## ingredients6-B,S,C,V,L,Sa 0.074751 0.218020 0.343 0.731732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4257 on 2506 degrees of freedom
## Multiple R-squared: 0.09439, Adjusted R-squared: 0.08608
## F-statistic: 11.36 on 23 and 2506 DF, p-value: < 2.2e-16
Model Diagnosis
gg_resfitted(model_extended) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
residual_plots <- gg_resX(model_extended)
gg_reshist(model_extended)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_qqplot(model_extended)
plot(cooks.distance(model_extended))
The analysis reveals a weak negative correlation between cocoa percentage and chocolate ratings, suggesting that higher cocoa percentages do not consistently lead to higher ratings in reviews.
Chocolate manufacturers should not solely rely on increasing cocoa percentage to improve ratings, as other factors may also influence consumer preferences.
Based on the visualization, fruits appear to be a dominant characteristic across chocolates from Ecuador, Madagascar, and peru, potentially suggesting a preference for fruity notes in chocolates from these regions.
Manufacturers sourcing beans from regions like Ecuador, Madagascar, and Peru may consider incorporating fruity notes into their chocolate products to cater to potential regional preferences.
Based on the visualization, there is no clear trend between cocoa percentage and rating, suggesting consumer preferences for these factors may be varied.
The lack of a clear trend between cocoa percentage and rating suggests that consumer preferences for these factors may vary over time. Continuous monitoring and adaptation to changing consumer preferences are recommended.
The statistical analysis indicates a significant difference in mean ratings between chocolate bars with cocoa percentages above and below 70%, with bars containing higher cocoa percentages receiving lower ratings on average.
Chocolate bars with cocoa percentages above and below 70% may cater to different consumer segments, and manufacturers should consider diversifying their product offerings to accommodate these preferences.
The linear regression analysis suggests that as cocoa percentage increases, the anticipated chocolate rating decreases, with both the intercept and coefficient being statistically significant.
Manufacturers can anticipate a decrease in chocolate ratings as cocoa percentage increases, highlighting the importance of balancing cocoa percentage with other flavor and texture factors.
In conclusion, cocoa percentage and certain ingredient combinations appear to be significant predictors of the product rating, while the review date has a relatively weaker association. However, the model only explains a small portion of the variability in the rating, suggesting that other factors not included in the model may also play a role.
In practical scenarios, chocolate manufacturers should conduct thorough market research to understand consumer preferences in different regions and time periods. They should also consider diversifying their product offerings to cater to varying consumer preferences regarding cocoa percentage and flavor profiles. Additionally, ongoing experimentation and innovation are essential to meet evolving consumer tastes and preferences in the chocolate industry.