---
title: "data dive week 2"
output: html_document
---

Load your dataset

data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")

Numeric Summary for Column 1

summary_col1 <- summary(data$cocoa_percent) 
print(summary_col1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4200  0.7000  0.7000  0.7164  0.7400  1.0000

Numeric Summary for Column 2

summary_col2 <- summary(data$rating)
print(summary_col2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.250   3.196   3.500   4.000

Categorical Summary for Column 3

unique_values_col3 <- unique(data$ingredients)
count_values_col3 <- table(data$ingredients)
print(unique_values_col3)

##  [1] "3- B,S,C"       "4- B,S,C,L"     "2- B,S"         "4- B,S,C,V"    
##  [5] "5- B,S,C,V,L"   "6-B,S,C,V,L,Sa" "5-B,S,C,V,Sa"   ""              
##  [9] "4- B,S,V,L"     "2- B,S*"        "1- B"           "3- B,S*,C"     
## [13] "3- B,S,L"       "3- B,S,V"       "4- B,S*,C,L"    "4- B,S*,C,Sa"  
## [17] "3- B,S*,Sa"     "4- B,S,C,Sa"    "4- B,S*,V,L"    "2- B,C"        
## [21] "4- B,S*,C,V"    "5- B,S,C,L,Sa"

print(count_values_col3)

## 
##                          1- B         2- B,C         2- B,S        2- B,S* 
##             87              6              1            718             31 
##      3- B,S*,C     3- B,S*,Sa       3- B,S,C       3- B,S,L       3- B,S,V 
##             12              1            999              8              3 
##    4- B,S*,C,L   4- B,S*,C,Sa    4- B,S*,C,V    4- B,S*,V,L     4- B,S,C,L 
##              2             20              7              3            286 
##    4- B,S,C,Sa     4- B,S,C,V     4- B,S,V,L  5- B,S,C,L,Sa   5- B,S,C,V,L 
##              5            141              5              1            184 
##   5-B,S,C,V,Sa 6-B,S,C,V,L,Sa 
##              6              4

Print the summaries

cat("Categorical Summary for ingredients:\n")

## Categorical Summary for ingredients:

print(data.frame(Value = unique_values_col3, Count = count_values_col3))

##             Value     Count.Var1 Count.Freq
## 1        3- B,S,C                        87
## 2      4- B,S,C,L           1- B          6
## 3          2- B,S         2- B,C          1
## 4      4- B,S,C,V         2- B,S        718
## 5    5- B,S,C,V,L        2- B,S*         31
## 6  6-B,S,C,V,L,Sa      3- B,S*,C         12
## 7    5-B,S,C,V,Sa     3- B,S*,Sa          1
## 8                       3- B,S,C        999
## 9      4- B,S,V,L       3- B,S,L          8
## 10        2- B,S*       3- B,S,V          3
## 11           1- B    4- B,S*,C,L          2
## 12      3- B,S*,C   4- B,S*,C,Sa         20
## 13       3- B,S,L    4- B,S*,C,V          7
## 14       3- B,S,V    4- B,S*,V,L          3
## 15    4- B,S*,C,L     4- B,S,C,L        286
## 16   4- B,S*,C,Sa    4- B,S,C,Sa          5
## 17     3- B,S*,Sa     4- B,S,C,V        141
## 18    4- B,S,C,Sa     4- B,S,V,L          5
## 19    4- B,S*,V,L  5- B,S,C,L,Sa          1
## 20         2- B,C   5- B,S,C,V,L        184
## 21    4- B,S*,C,V   5-B,S,C,V,Sa          6
## 22  5- B,S,C,L,Sa 6-B,S,C,V,L,Sa          4

Hypothesis

Given the data documentation and column summaries, are there any particular attributes of chocolate bars that are highly rated that we can identify? How do these traits differ between the various nations where beans are grown?
Are there any observable trends in cocoa percentage over the years? Additionally, is there a correlation between cocoa percentage and the rating of chocolate bars?
How do the country of bean origin and the company’s location interact with the distribution of ratings?

Aggregation for question 2

data$review_date <- as.factor(data$review_date)

# Aggregate mean rating and cocoa percentage for each year
agg_data <- aggregate(cbind(rating, cocoa_percent) ~ review_date, data = data, FUN = mean)
print(agg_data)

##    review_date   rating cocoa_percent
## 1         2006 3.004032     0.7043548
## 2         2007 3.102740     0.7208219
## 3         2008 3.000000     0.7267391
## 4         2009 3.073171     0.7044309
## 5         2010 3.152273     0.7081364
## 6         2011 3.257669     0.7096933
## 7         2012 3.180412     0.7155155
## 8         2013 3.196721     0.7227869
## 9         2014 3.189271     0.7225304
## 10        2015 3.244718     0.7202113
## 11        2016 3.228111     0.7177419
## 12        2017 3.361905     0.7151429
## 13        2018 3.191886     0.7123026
## 14        2019 3.134715     0.7197150
## 15        2020 3.256173     0.7074074
## 16        2021 3.320000     0.7176000

library(ggplot2)

Visualization summary

Distribution Visualization:

Histogram for ‘cocoa_percent’:

ggplot(data, aes(x = country_of_bean_origin, y = rating, fill = country_of_bean_origin)) +
  geom_boxplot() +
  labs(title = "Box Plot of Rating by Country of Bean Origin",
       x = "Country of Bean Origin",
       y = "Rating",
       fill = "Country of Bean Origin")

Trends and Correlations Visualization:

Scatter plot for the correlation between ‘cocoa_percent’ and ‘rating’:

ggplot(data, aes(x = cocoa_percent, y = rating, color = country_of_bean_origin)) +
  geom_point() +
  labs(title = "Scatter Plot of Cocoa Percentage vs. Rating",
       x = "Cocoa Percentage",
       y = "Rating",
       color = "Country of Bean Origin")

categorical variables interact with continuous variables

To demonstrate how categorical variables interact with continuous variables using different channels, let’s create a scatter plot where the color represents different categories in the ‘ingredients’ column and the x-axis represents the ‘cocoa_percent’ while the y-axis represents the ‘rating’. We’ll use the ‘ggplot2’ package in R for this visualization.

# Load the ggplot2 library
library(ggplot2)

# Scatter plot with color representing different categories
ggplot(data, aes(x = cocoa_percent, y = rating, color = ingredients)) +
  geom_point() +
  labs(title = "Scatter Plot of Cocoa Percentage vs. Rating by Ingredients",
       x = "Cocoa Percentage",
       y = "Rating",
       color = "Ingredients") +
  theme_minimal()

Insights for the above tasks

Numeric Summary for Column 1s

Insight -The summary statistics give a brief picture of how the percentages of cocoa in your sample are distributed. A comparatively symmetric distribution is shown by the closeness of the median and mean. The distribution of cocoa percentages is displayed by the values ranging from the lowest to the highest.

Significance- Analysing the central tendency and variability in the dataset requires a grasp of the summary statistics of cocoa percentages. It assists in determining possible outliers as well as the usual range of cocoa percentages.

Further Question

Are there any particular chocolate bars that have a disproportionately high or low cocoa content?

Numeric Summary for Column 2

Insight-The summary statistics give a quick overview of the rating distribution. A comparatively symmetric distribution is shown by the closeness of the mean and median ratings. The dispersion of ratings is displayed by the values ranging from the lowest to the highest.

Significance- Determining the general level of pleasure or perception with chocolate bars in the dataset requires an understanding of the rating distribution. It aids in determining the ratings’ variability and central tendency.

Further Question

Are any particular chocolate bars have very high or low ratings?

Categorical Summary for Column 3

Insight - The ‘ingredients’ column displays a range of distinct values that indicate various ingredient compositions in chocolate bars. The numbers give details on how frequently each constituent composition appears in the dataset.

Significance- Determining the diversity of chocolate bars necessitates an understanding of the distribution of ingredient compositions. It facilitates the identification of typical constituent combinations and formulation variations.

Further Question

Which ingredient composition appears most frequently in the dataset?

Exist any ingredient combinations that are linked to better or worse ratings?

Distribution Visualization:

insight-The distribution of cocoa % histogram indicates that the cocoa percentages of most chocolate bars in the sample are concentrated within a specific range. This sheds light on the dataset’s total concentration of cocoa percentages.

Significance- Determining the diversity of chocolate products requires an understanding of the variation of cocoa percentages. It assists in locating typical ranges and possible outliers that could affect taste profiles and customer preferences.

Further Question

Which cocoa % is most prevalent across the dataset?

Does the percentage of cocoa contain any outliers, and if so, how do they affect the distribution as a whole?

Trends and Correlations Visualization:

Insight- Variations can be seen in the line plot that represents the average rating trend over time. Furthermore, a moderately positive association between the ratings and the cocoa content is seen by the scatter plot.

Significance- Determining patterns in mean evaluations offers perceptions into prospective shifts in customer inclinations throughout time. Given that ratings and cocoa percentage have a positive link, it is possible that, generally speaking, larger cocoa percentages correspond to higher ratings.

Further Question

What variables could be causing the average ratings to fluctuate over time?

Are there certain percentages of cocoa that are routinely rated higher?

These realisations and inquiries can direct additional investigation and analysis, enabling a more thorough comprehension of the dataset and possibly influencing choices about product creation, promotion, and quality enhancement in the chocolate industry.

Load your dataset

Numeric Summary for Column 1

Numeric Summary for Column 2

Categorical Summary for Column 3

Print the summaries

Hypothesis

Aggregation for question 2

Visualization summary

Distribution Visualization:

Histogram for ‘cocoa_percent’:

Trends and Correlations Visualization:

Scatter plot for the correlation between ‘cocoa_percent’ and ‘rating’:

categorical variables interact with continuous variables

Insights for the above tasks

Numeric Summary for Column 1s

Numeric Summary for Column 2

Categorical Summary for Column 3

Distribution Visualization:

Trends and Correlations Visualization: