data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
set.seed(123)
# Calculate the size of each subsample (50% of the data)
subsample_size <- round(nrow(data) * 0.5)
# Create five random samples with replacement
df_1 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_2 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_3 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_4 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_5 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
head(df_1)
## ref company_manufacturer company_location review_date
## 2463 1916 Wm U.S.A. 2016
## 2511 749 Zotter Austria 2011
## 2227 2012 Stone Grindz U.S.A. 2017
## 526 1772 Chocolate Alchemist-Philly U.S.A. 2016
## 195 1474 Bahen & Co. Australia 2015
## 1842 141 Pierre Marcolini Belgium 2007
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 2463 Ghana Ghana, 2013, batch 129 0.75
## 2511 Congo Congo 0.65
## 2227 Bolivia Wild Bolivia, batch 260 0.70
## 526 Peru Tumbes, "Zarumilla" 0.90
## 195 Papua New Guinea Papua New Guinea 0.70
## 1842 Madagascar Sambirano, Ambanja, Madagascar 0.72
## ingredients most_memorable_characteristics rating
## 2463 3- B,S,C strong malt, choco pudding 3.75
## 2511 4- B,S*,C,Sa dairy, salt, caramel 3.00
## 2227 3- B,S,C fatty, rubber, off notes 2.50
## 526 2- B,S* sticky, bitter, molasses, tart 2.50
## 195 2- B,S smoke, ham, papaya 3.50
## 1842 5- B,S,C,V,L tangy, floral, spicy, cocoa 4.00
head(df_2)
## ref company_manufacturer company_location review_date
## 1290 2064 Krak Amsterdam 2018
## 837 32 El Rey Venezuela 2006
## 183 1780 Askinosie U.S.A. 2016
## 2513 749 Zotter Austria 2011
## 1658 1896 Mutari U.S.A. 2016
## 1154 1550 hexx U.S.A. 2015
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 1290 Madagascar Mava Sa Ferme D'ottange 0.70
## 837 Venezuela Carenero Superior, Gran Saman 0.70
## 183 Tanzania Mababa 0.68
## 2513 India Kerala State 0.65
## 1658 Tanzania Kokoa Kamili, batch 1 SRB 0.68
## 1154 Peru Peru 0.70
## ingredients most_memorable_characteristics rating
## 1290 2- B,S roasty, nutty 3.00
## 837 5- B,S,C,V,L gritty, chalky, earthy, sour 2.75
## 183 3- B,S,C creamy, rich, blueberry 3.75
## 2513 4- B,S*,C,Sa creamy, masculine, earthy 3.50
## 1658 2- B,S dry, earthy, atypical 2.75
## 1154 2- B,S* palm, spicy, flour 3.25
head(df_3)
## ref company_manufacturer company_location review_date
## 680 2178 Dandelion U.S.A. 2018
## 1019 899 Friis Holm (Bonnat) Denmark 2012
## 2375 1327 Urzi Italy 2014
## 1412 2692 Luisa Abram Brazil 2021
## 972 899 Fresco U.S.A. 2012
## 1005 2126 Friis Holm Denmark 2018
## country_of_bean_origin specific_bean_origin_or_bar_name
## 680 India Anamalai, 2017 harvest
## 1019 Nicaragua Chuno, double turned, Xoco
## 2375 Venezuela Sur del Lago, Merida
## 1412 Brazil Wild Jurua (limited edition for Caputo's)
## 972 Papua New Guinea Markham Valley, #221, DR, MC
## 1005 Nicaragua La Dalia, Lazy Growers Blend
## cocoa_percent ingredients most_memorable_characteristics rating
## 680 0.70 2- B,S ripe orange, citrus 3.50
## 1019 0.70 3- B,S,C oily, fatty, bold olive 3.25
## 2375 0.65 4- B,S,C,V intense, nutty, mild rubber 3.25
## 1412 0.70 3- B,S,C mild sweet honey, rum, off 3.25
## 972 0.69 2- B,S smoke, fruit, sour, hammy 3.50
## 1005 0.70 3- B,S,C woody, grassy, spicy 3.25
head(df_4)
## ref company_manufacturer company_location review_date
## 1689 1022 Night Owl U.S.A. 2013
## 821 1626 Durci U.S.A. 2015
## 358 1323 Burnt Fork Bend U.S.A. 2014
## 1175 1466 Holy Cacao Israel 2015
## 1582 117 Michel Cluizel France 2007
## 2017 2660 Ruket Italy 2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 1689 Ecuador Ecuador 0.75
## 821 Peru Maranon, Joya Rara 0.70
## 358 Ecuador Ecuador, Bob Bar 0.60
## 1175 Ecuador Camino Verde P., Balao, Guayas 0.70
## 1582 Venezuela Carenero Superior, Concepcion 0.66
## 2017 Tanzania Kokoa Kamili. Lot 73T 0.72
## ingredients most_memorable_characteristics rating
## 1689 3- B,S,C single note, spicy 3.5
## 821 3- B,S,C sandy, orange, butterscotch 3.5
## 358 3- B,S,C sweet, sour dairy 2.5
## 1175 4- B,S,C,L floral, roasty, coffee 3.5
## 1582 4- B,S,C,V cocoa, orange 3.5
## 2017 2- B,S clean, pure cocoa, cherry, gritty 3.5
head(df_5)
## ref company_manufacturer company_location review_date
## 345 1514 Brazen U.S.A. 2015
## 1567 1494 Mesocacao Honduras 2015
## 234 2246 Benns Singapore 2018
## 372 502 Cacao Atlanta U.S.A. 2010
## 1666 1399 Naive Lithuania 2014
## 382 1391 Cacao de Origen Venezuela 2014
## country_of_bean_origin specific_bean_origin_or_bar_name
## 345 Dominican Republic Elvesia P.
## 1567 El Salvador El Salvador
## 234 Malaysia Sungai Ruan, Koh Farm
## 372 Dominican Republic Dominican Republic w/ nibs
## 1666 Venezuela Chuao, lot 0077
## 382 Venezuela Agua Fria; Sucre region, H. La Trinidad
## cocoa_percent ingredients most_memorable_characteristics rating
## 345 0.80 3- B,S,C bitter, dark berry, fatty 3.25
## 1567 0.70 3- B,S,C dairy, caramel 3.00
## 234 0.72 4- B,S,C,L off aroma, smokey, off note 3.00
## 372 0.75 2- B,S creamy, burnt wood 2.75
## 1666 0.75 4- B,S,C,L coarse, sweet, minty 2.75
## 382 0.75 2- B,S intense, earthy, fuel 2.50
Insight
Examining smaller groups within your data (subsamples) reveals how results might change when drawing different random samples from the entire population.
Significance
Recognizing this variability is vital for confidently applying your findings to the whole population. It helps judge the consistency and trustworthiness of conclusions based on a single sample.
Further Questions
Do patterns hold steady across different subsamples, or do results differ considerably? How much would conclusions change if we picked different random data sets? What factors explain the observed variations between subsample.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
subsamples <- list(df_1, df_2, df_3, df_4, df_5)
# Calculate summary statistics for each subsample
summary_stats <- lapply(subsamples, function(subsample) {
subsample %>%
group_by(country_of_bean_origin) %>%
summary()
})
for (i in 1:5) {
cat("\nSummary Stats for Subsample", i, ":\n")
print(summary_stats[[i]])
}
##
## Summary Stats for Subsample 1 :
## ref company_manufacturer company_location review_date
## Min. : 15 Length:1265 Length:1265 Min. :2006
## 1st Qu.: 833 Class :character Class :character 1st Qu.:2012
## Median :1466 Mode :character Mode :character Median :2015
## Mean :1460 Mean :2015
## 3rd Qu.:2134 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:1265 Length:1265 Min. :0.4200
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7155
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:1265 Length:1265 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.176
## 3rd Qu.:3.500
## Max. :4.000
##
## Summary Stats for Subsample 2 :
## ref company_manufacturer company_location review_date
## Min. : 15 Length:1265 Length:1265 Min. :2006
## 1st Qu.: 781 Class :character Class :character 1st Qu.:2011
## Median :1407 Mode :character Mode :character Median :2014
## Mean :1392 Mean :2014
## 3rd Qu.:2036 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:1265 Length:1265 Min. :0.5300
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7171
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:1265 Length:1265 Min. :1.00
## Class :character Class :character 1st Qu.:3.00
## Mode :character Mode :character Median :3.25
## Mean :3.18
## 3rd Qu.:3.50
## Max. :4.00
##
## Summary Stats for Subsample 3 :
## ref company_manufacturer company_location review_date
## Min. : 5 Length:1265 Length:1265 Min. :2006
## 1st Qu.: 785 Class :character Class :character 1st Qu.:2011
## Median :1454 Mode :character Mode :character Median :2015
## Mean :1428 Mean :2014
## 3rd Qu.:2084 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:1265 Length:1265 Min. :0.530
## Class :character Class :character 1st Qu.:0.700
## Mode :character Mode :character Median :0.700
## Mean :0.718
## 3rd Qu.:0.750
## Max. :1.000
## ingredients most_memorable_characteristics rating
## Length:1265 Length:1265 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.175
## 3rd Qu.:3.500
## Max. :4.000
##
## Summary Stats for Subsample 4 :
## ref company_manufacturer company_location review_date
## Min. : 15 Length:1265 Length:1265 Min. :2006
## 1st Qu.: 781 Class :character Class :character 1st Qu.:2011
## Median :1403 Mode :character Mode :character Median :2014
## Mean :1391 Mean :2014
## 3rd Qu.:2056 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:1265 Length:1265 Min. :0.4200
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7156
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:1265 Length:1265 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.187
## 3rd Qu.:3.500
## Max. :4.000
##
## Summary Stats for Subsample 5 :
## ref company_manufacturer company_location review_date
## Min. : 15 Length:1265 Length:1265 Min. :2006
## 1st Qu.: 793 Class :character Class :character 1st Qu.:2012
## Median :1450 Mode :character Mode :character Median :2015
## Mean :1429 Mean :2014
## 3rd Qu.:2100 3rd Qu.:2018
## Max. :2712 Max. :2021
## country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## Length:1265 Length:1265 Min. :0.4600
## Class :character Class :character 1st Qu.:0.7000
## Mode :character Mode :character Median :0.7000
## Mean :0.7131
## 3rd Qu.:0.7400
## Max. :1.0000
## ingredients most_memorable_characteristics rating
## Length:1265 Length:1265 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :3.250
## Mean :3.184
## 3rd Qu.:3.500
## Max. :4.000
# Number of Monte Carlo simulations
num_simulations <- 1000
# Monte Carlo simulation for mean ratings
monte_carlo_means <- replicate(num_simulations, {
random_subsample <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
mean(random_subsample$rating)
})
# Plot the distribution of Monte Carlo means
hist(monte_carlo_means, main = "Monte Carlo Simulation of Mean Ratings", xlab = "Mean Rating")
# Identify anomalies in mean ratings
anomalies_ratings <- boxplot.stats(unlist(monte_carlo_means))$out
cat("Anomalies in Mean Ratings:\n")
## Anomalies in Mean Ratings:
print(anomalies_ratings)
## [1] 3.161265 3.159684 3.161462
# Check for outliers in the original dataset
outliers_df <- data %>%
filter(rating %in% anomalies_ratings)
cat("\nRows with Anomalous Ratings in the Original Dataset:\n")
##
## Rows with Anomalous Ratings in the Original Dataset:
print(outliers_df)
## [1] ref company_manufacturer
## [3] company_location review_date
## [5] country_of_bean_origin specific_bean_origin_or_bar_name
## [7] cocoa_percent ingredients
## [9] most_memorable_characteristics rating
## <0 rows> (or 0-length row.names)
# Identify anomalies in categorical variables (e.g., ingredients)
anomalies_ingredients <- data %>%
group_by(ingredients) %>%
summarize(count = n()) %>%
filter(count <= 10)
cat("\nPotential Anomalies in Ingredients (Occurring <= 10 times):\n")
##
## Potential Anomalies in Ingredients (Occurring <= 10 times):
print(anomalies_ingredients)
## # A tibble: 13 × 2
## ingredients count
## <chr> <int>
## 1 1- B 6
## 2 2- B,C 1
## 3 3- B,S*,Sa 1
## 4 3- B,S,L 8
## 5 3- B,S,V 3
## 6 4- B,S*,C,L 2
## 7 4- B,S*,C,V 7
## 8 4- B,S*,V,L 3
## 9 4- B,S,C,Sa 5
## 10 4- B,S,V,L 5
## 11 5- B,S,C,L,Sa 1
## 12 5-B,S,C,V,Sa 6
## 13 6-B,S,C,V,L,Sa 4
Insights from the summary statistics
All datasets consistently span reviews from 2006 to 2021, suggesting a long-term perspective on chocolate preferences.
The average cocoa content remains consistent across groups, ranging from 71% to 72%. This finding aligns with typical preferences for moderately dark chocolate.
The variety of bean origins, bar names, ingredients, and memorable characteristics highlights the rich diversity of chocolate products within the dataset. This allows for deeper exploration of how these factors influence perception.
The majority of ratings fall within the “recommended” range of 3.0 to 3.5 across all datasets. This indicates generally positive sentiment towards the chocolates reviewed, but room for differentiation based on specific characteristics.
Significance
The observed consistency in mean ratings across subsamples, despite variations in specific details, is a strong indicator of the stability and reliability of the rating system. This suggests that individual preference differences are not significantly skewing the overall ratings, providing a solid foundation for further analysis. This consistency allows you to confidently draw conclusions about general trends and patterns without worrying about undue bias from individual reviewers.
Identifying consistent patterns in ingredients across subsamples is a valuable clue to understanding consumer preferences. By analyzing which ingredients frequently appear with higher ratings, you can identify specific characteristics that resonate with consumers. This information is crucial for chocolate makers as it helps them formulate products that better cater to consumer desires. It also helps consumers make informed choices when selecting chocolate based on their own preferences.
Further Questions
What are the most common ingredients in chocolates across all subsamples?
Are there specific ingredients that consistently result in higher or lower ratings?
Are there any noticeable temporal trends in ratings or ingredient preferences over the years?
Do certain countries consistently produce chocolates with higher ratings?
What are the distinctive features of chocolates from different countries?