data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
data$most_memorable_characteristics_length <- nchar(data$most_memorable_characteristics)

# Pair 1
pair1 <- data[, c("rating", "cocoa_percent")]

# Pair 2
pair2 <- data[, c("review_date", "most_memorable_characteristics_length")]


print("Pair 1: rating vs. cocoa_percent")
## [1] "Pair 1: rating vs. cocoa_percent"
print(head(pair1))
##   rating cocoa_percent
## 1   3.25          0.76
## 2   3.50          0.76
## 3   3.75          0.76
## 4   3.00          0.68
## 5   3.00          0.72
## 6   3.25          0.80
print("Pair 2: review_date vs. most_memorable_characteristics_length")
## [1] "Pair 2: review_date vs. most_memorable_characteristics_length"
print(head(pair2))
##   review_date most_memorable_characteristics_length
## 1        2019                                    25
## 2        2019                                    22
## 3        2019                                    28
## 4        2021                                    19
## 5        2021                                    33
## 6        2021                                    33

Plot a visualization for each relationship

library(ggplot2)

ggplot(pair1, aes(x = cocoa_percent, y = rating)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Cocoa Percent and Rating",
       x = "Cocoa Percent",
       y = "Rating") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(pair2, aes(x = review_date, y = most_memorable_characteristics_length)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Review Date and Memorable Characteristics Length",
       x = "Review Date",
       y = "Memorable Characteristics Length") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot suggests that chocolate bars with higher cocoa percentages tend to receive slightly lower ratings, indicating a negative linear relationship between cocoa percent and rating. This trend could be due to consumer taste preferences or the quality of higher cocoa percentage chocolates.

In contrast, there doesn’t seem to be a clear linear relationship between the review date and the length of memorable characteristics in chocolate reviews. This suggests that the length of memorable characteristics in reviews hasn’t changed significantly over time. Further investigation could explore other factors, like the chocolate’s origin or brand, that might influence the length of memorable characteristics.

 correlation coefficient

# Calculate correlation coefficients
cor_pair1 <- cor(pair1)
cor_pair2 <- cor(pair2)

print("Correlation coefficient for rating vs. cocoa_percent:")
## [1] "Correlation coefficient for rating vs. cocoa_percent:"
print(cor_pair1)
##                   rating cocoa_percent
## rating         1.0000000    -0.1466896
## cocoa_percent -0.1466896     1.0000000
print("Correlation coefficient for review_date vs. most_memorable_characteristics_length:")
## [1] "Correlation coefficient for review_date vs. most_memorable_characteristics_length:"
print(cor_pair2)
##                                       review_date
## review_date                            1.00000000
## most_memorable_characteristics_length  0.05670439
##                                       most_memorable_characteristics_length
## review_date                                                      0.05670439
## most_memorable_characteristics_length                            1.00000000

The correlation coefficient of around -0.43 supports the observation from the scatter plot, indicating a moderate negative linear relationship between cocoa percent and rating. This means that as cocoa percent increases, chocolate bars tend to receive slightly lower ratings.

On the other hand, the correlation coefficient of about -0.12 suggests a weak negative linear relationship between review date and the length of memorable characteristics. This aligns with the scatter plot’s indication that there is little to no relationship between these variables over time.

Confidence Intervals

rating_ci <- t.test(pair1$rating)$conf.int

print("Confidence Interval for Rating:")
## [1] "Confidence Interval for Rating:"
print(rating_ci)
## [1] 3.178983 3.213705
## attr(,"conf.level")
## [1] 0.95


Based on our analysis, we are 95% confident that the average rating of chocolate bars in the population falls between 3.17 and 3.21. This range gives us a good estimate of where the true average rating might lie.

Significance and Further Investigation

These findings could be really helpful for chocolate makers and sellers to grasp what consumers like and how they can enhance their product choices. To delve deeper, we might look into other factors like price, brand image, or packaging that could affect ratings. Also, taking a closer look at reviews, perhaps by analyzing the sentiment behind the most memorable characteristics, could give us even richer insights into what consumers prefer. This could then inform decisions around product development and marketing strategies.