---
title: "data dive week 2"
output: html_document
---
data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
summary_col1 <- summary(data$cocoa_percent)
print(summary_col1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4200 0.7000 0.7000 0.7164 0.7400 1.0000
summary_col2 <- summary(data$rating)
print(summary_col2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.250 3.196 3.500 4.000
unique_values_col3 <- unique(data$ingredients)
count_values_col3 <- table(data$ingredients)
print(unique_values_col3)
## [1] "3- B,S,C" "4- B,S,C,L" "2- B,S" "4- B,S,C,V"
## [5] "5- B,S,C,V,L" "6-B,S,C,V,L,Sa" "5-B,S,C,V,Sa" ""
## [9] "4- B,S,V,L" "2- B,S*" "1- B" "3- B,S*,C"
## [13] "3- B,S,L" "3- B,S,V" "4- B,S*,C,L" "4- B,S*,C,Sa"
## [17] "3- B,S*,Sa" "4- B,S,C,Sa" "4- B,S*,V,L" "2- B,C"
## [21] "4- B,S*,C,V" "5- B,S,C,L,Sa"
print(count_values_col3)
##
## 1- B 2- B,C 2- B,S 2- B,S*
## 87 6 1 718 31
## 3- B,S*,C 3- B,S*,Sa 3- B,S,C 3- B,S,L 3- B,S,V
## 12 1 999 8 3
## 4- B,S*,C,L 4- B,S*,C,Sa 4- B,S*,C,V 4- B,S*,V,L 4- B,S,C,L
## 2 20 7 3 286
## 4- B,S,C,Sa 4- B,S,C,V 4- B,S,V,L 5- B,S,C,L,Sa 5- B,S,C,V,L
## 5 141 5 1 184
## 5-B,S,C,V,Sa 6-B,S,C,V,L,Sa
## 6 4
cat("Categorical Summary for ingredients:\n")
## Categorical Summary for ingredients:
print(data.frame(Value = unique_values_col3, Count = count_values_col3))
## Value Count.Var1 Count.Freq
## 1 3- B,S,C 87
## 2 4- B,S,C,L 1- B 6
## 3 2- B,S 2- B,C 1
## 4 4- B,S,C,V 2- B,S 718
## 5 5- B,S,C,V,L 2- B,S* 31
## 6 6-B,S,C,V,L,Sa 3- B,S*,C 12
## 7 5-B,S,C,V,Sa 3- B,S*,Sa 1
## 8 3- B,S,C 999
## 9 4- B,S,V,L 3- B,S,L 8
## 10 2- B,S* 3- B,S,V 3
## 11 1- B 4- B,S*,C,L 2
## 12 3- B,S*,C 4- B,S*,C,Sa 20
## 13 3- B,S,L 4- B,S*,C,V 7
## 14 3- B,S,V 4- B,S*,V,L 3
## 15 4- B,S*,C,L 4- B,S,C,L 286
## 16 4- B,S*,C,Sa 4- B,S,C,Sa 5
## 17 3- B,S*,Sa 4- B,S,C,V 141
## 18 4- B,S,C,Sa 4- B,S,V,L 5
## 19 4- B,S*,V,L 5- B,S,C,L,Sa 1
## 20 2- B,C 5- B,S,C,V,L 184
## 21 4- B,S*,C,V 5-B,S,C,V,Sa 6
## 22 5- B,S,C,L,Sa 6-B,S,C,V,L,Sa 4
Given the data documentation and column summaries, are there any particular attributes of chocolate bars that are highly rated that we can identify? How do these traits differ between the various nations where beans are grown?
Are there any observable trends in cocoa percentage over the years? Additionally, is there a correlation between cocoa percentage and the rating of chocolate bars?
How do the country of bean origin and the company’s location interact with the distribution of ratings?
data$review_date <- as.factor(data$review_date)
# Aggregate mean rating and cocoa percentage for each year
agg_data <- aggregate(cbind(rating, cocoa_percent) ~ review_date, data = data, FUN = mean)
print(agg_data)
## review_date rating cocoa_percent
## 1 2006 3.004032 0.7043548
## 2 2007 3.102740 0.7208219
## 3 2008 3.000000 0.7267391
## 4 2009 3.073171 0.7044309
## 5 2010 3.152273 0.7081364
## 6 2011 3.257669 0.7096933
## 7 2012 3.180412 0.7155155
## 8 2013 3.196721 0.7227869
## 9 2014 3.189271 0.7225304
## 10 2015 3.244718 0.7202113
## 11 2016 3.228111 0.7177419
## 12 2017 3.361905 0.7151429
## 13 2018 3.191886 0.7123026
## 14 2019 3.134715 0.7197150
## 15 2020 3.256173 0.7074074
## 16 2021 3.320000 0.7176000
library(ggplot2)
ggplot(data, aes(x = country_of_bean_origin, y = rating, fill = country_of_bean_origin)) +
geom_boxplot() +
labs(title = "Box Plot of Rating by Country of Bean Origin",
x = "Country of Bean Origin",
y = "Rating",
fill = "Country of Bean Origin")
ggplot(data, aes(x = cocoa_percent, y = rating, color = country_of_bean_origin)) +
geom_point() +
labs(title = "Scatter Plot of Cocoa Percentage vs. Rating",
x = "Cocoa Percentage",
y = "Rating",
color = "Country of Bean Origin")
To demonstrate how categorical variables interact with continuous variables using different channels, let’s create a scatter plot where the color represents different categories in the ‘ingredients’ column and the x-axis represents the ‘cocoa_percent’ while the y-axis represents the ‘rating’. We’ll use the ‘ggplot2’ package in R for this visualization.
# Load the ggplot2 library
library(ggplot2)
# Scatter plot with color representing different categories
ggplot(data, aes(x = cocoa_percent, y = rating, color = ingredients)) +
geom_point() +
labs(title = "Scatter Plot of Cocoa Percentage vs. Rating by Ingredients",
x = "Cocoa Percentage",
y = "Rating",
color = "Ingredients") +
theme_minimal()
Insight -The summary statistics give a brief picture of how the percentages of cocoa in your sample are distributed. A comparatively symmetric distribution is shown by the closeness of the median and mean. The distribution of cocoa percentages is displayed by the values ranging from the lowest to the highest.
Significance- Analysing the central tendency and variability in the dataset requires a grasp of the summary statistics of cocoa percentages. It assists in determining possible outliers as well as the usual range of cocoa percentages.
Further Question
Are there any particular chocolate bars that have a disproportionately high or low cocoa content?
Insight-The summary statistics give a quick overview of the rating distribution. A comparatively symmetric distribution is shown by the closeness of the mean and median ratings. The dispersion of ratings is displayed by the values ranging from the lowest to the highest.
Significance- Determining the general level of pleasure or perception with chocolate bars in the dataset requires an understanding of the rating distribution. It aids in determining the ratings’ variability and central tendency.
Further Question
Are any particular chocolate bars have very high or low ratings?
Insight - The ‘ingredients’ column displays a range of distinct values that indicate various ingredient compositions in chocolate bars. The numbers give details on how frequently each constituent composition appears in the dataset.
Significance- Determining the diversity of chocolate bars necessitates an understanding of the distribution of ingredient compositions. It facilitates the identification of typical constituent combinations and formulation variations.
Further Question
Which ingredient composition appears most frequently in the dataset?
Exist any ingredient combinations that are linked to better or worse ratings?
insight-The distribution of cocoa % histogram indicates that the cocoa percentages of most chocolate bars in the sample are concentrated within a specific range. This sheds light on the dataset’s total concentration of cocoa percentages.
Significance- Determining the diversity of chocolate products requires an understanding of the variation of cocoa percentages. It assists in locating typical ranges and possible outliers that could affect taste profiles and customer preferences.
Further Question
Which cocoa % is most prevalent across the dataset?
Does the percentage of cocoa contain any outliers, and if so, how do they affect the distribution as a whole?
Insight- Variations can be seen in the line plot that represents the average rating trend over time. Furthermore, a moderately positive association between the ratings and the cocoa content is seen by the scatter plot.
Significance- Determining patterns in mean evaluations offers perceptions into prospective shifts in customer inclinations throughout time. Given that ratings and cocoa percentage have a positive link, it is possible that, generally speaking, larger cocoa percentages correspond to higher ratings.
Further Question
What variables could be causing the average ratings to fluctuate over time?
Are there certain percentages of cocoa that are routinely rated higher?
These realisations and inquiries can direct additional investigation and analysis, enabling a more thorough comprehension of the dataset and possibly influencing choices about product creation, promotion, and quality enhancement in the chocolate industry.