data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")
set.seed(123) 

# Calculate the size of each subsample (50% of the data)
subsample_size <- round(nrow(data) * 0.5)

# Create five random samples with replacement
df_1 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_2 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_3 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_4 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
df_5 <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
head(df_1)
##       ref       company_manufacturer company_location review_date
## 2463 1916                         Wm           U.S.A.        2016
## 2511  749                     Zotter          Austria        2011
## 2227 2012               Stone Grindz           U.S.A.        2017
## 526  1772 Chocolate Alchemist-Philly           U.S.A.        2016
## 195  1474                Bahen & Co.        Australia        2015
## 1842  141           Pierre Marcolini          Belgium        2007
##      country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 2463                  Ghana           Ghana, 2013, batch 129          0.75
## 2511                  Congo                            Congo          0.65
## 2227                Bolivia          Wild Bolivia, batch 260          0.70
## 526                    Peru              Tumbes, "Zarumilla"          0.90
## 195        Papua New Guinea                 Papua New Guinea          0.70
## 1842             Madagascar   Sambirano, Ambanja, Madagascar          0.72
##       ingredients most_memorable_characteristics rating
## 2463     3- B,S,C     strong malt, choco pudding   3.75
## 2511 4- B,S*,C,Sa           dairy, salt, caramel   3.00
## 2227     3- B,S,C       fatty, rubber, off notes   2.50
## 526       2- B,S* sticky, bitter, molasses, tart   2.50
## 195        2- B,S             smoke, ham, papaya   3.50
## 1842 5- B,S,C,V,L    tangy, floral, spicy, cocoa   4.00
head(df_2)
##       ref company_manufacturer company_location review_date
## 1290 2064                 Krak        Amsterdam        2018
## 837    32               El Rey        Venezuela        2006
## 183  1780            Askinosie           U.S.A.        2016
## 2513  749               Zotter          Austria        2011
## 1658 1896               Mutari           U.S.A.        2016
## 1154 1550                 hexx           U.S.A.        2015
##      country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 1290             Madagascar          Mava Sa Ferme D'ottange          0.70
## 837               Venezuela    Carenero Superior, Gran Saman          0.70
## 183                Tanzania                           Mababa          0.68
## 2513                  India                     Kerala State          0.65
## 1658               Tanzania        Kokoa Kamili, batch 1 SRB          0.68
## 1154                   Peru                             Peru          0.70
##       ingredients most_memorable_characteristics rating
## 1290       2- B,S                  roasty, nutty   3.00
## 837  5- B,S,C,V,L   gritty, chalky, earthy, sour   2.75
## 183      3- B,S,C        creamy, rich, blueberry   3.75
## 2513 4- B,S*,C,Sa      creamy, masculine, earthy   3.50
## 1658       2- B,S          dry, earthy, atypical   2.75
## 1154      2- B,S*             palm, spicy, flour   3.25
head(df_3)
##       ref company_manufacturer company_location review_date
## 680  2178            Dandelion           U.S.A.        2018
## 1019  899  Friis Holm (Bonnat)          Denmark        2012
## 2375 1327                 Urzi            Italy        2014
## 1412 2692          Luisa Abram           Brazil        2021
## 972   899               Fresco           U.S.A.        2012
## 1005 2126           Friis Holm          Denmark        2018
##      country_of_bean_origin          specific_bean_origin_or_bar_name
## 680                   India                    Anamalai, 2017 harvest
## 1019              Nicaragua                Chuno, double turned, Xoco
## 2375              Venezuela                      Sur del Lago, Merida
## 1412                 Brazil Wild Jurua (limited edition for Caputo's)
## 972        Papua New Guinea              Markham Valley, #221, DR, MC
## 1005              Nicaragua              La Dalia, Lazy Growers Blend
##      cocoa_percent ingredients most_memorable_characteristics rating
## 680           0.70      2- B,S            ripe orange, citrus   3.50
## 1019          0.70    3- B,S,C        oily, fatty, bold olive   3.25
## 2375          0.65  4- B,S,C,V    intense, nutty, mild rubber   3.25
## 1412          0.70    3- B,S,C     mild sweet honey, rum, off   3.25
## 972           0.69      2- B,S      smoke, fruit, sour, hammy   3.50
## 1005          0.70    3- B,S,C           woody, grassy, spicy   3.25
head(df_4)
##       ref company_manufacturer company_location review_date
## 1689 1022            Night Owl           U.S.A.        2013
## 821  1626                Durci           U.S.A.        2015
## 358  1323      Burnt Fork Bend           U.S.A.        2014
## 1175 1466           Holy Cacao           Israel        2015
## 1582  117       Michel Cluizel           France        2007
## 2017 2660                Ruket            Italy        2021
##      country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
## 1689                Ecuador                          Ecuador          0.75
## 821                    Peru               Maranon, Joya Rara          0.70
## 358                 Ecuador                 Ecuador, Bob Bar          0.60
## 1175                Ecuador   Camino Verde P., Balao, Guayas          0.70
## 1582              Venezuela    Carenero Superior, Concepcion          0.66
## 2017               Tanzania            Kokoa Kamili. Lot 73T          0.72
##      ingredients    most_memorable_characteristics rating
## 1689    3- B,S,C                single note, spicy    3.5
## 821     3- B,S,C       sandy, orange, butterscotch    3.5
## 358     3- B,S,C                 sweet, sour dairy    2.5
## 1175  4- B,S,C,L            floral, roasty, coffee    3.5
## 1582  4- B,S,C,V                     cocoa, orange    3.5
## 2017      2- B,S clean, pure cocoa, cherry, gritty    3.5
head(df_5)
##       ref company_manufacturer company_location review_date
## 345  1514               Brazen           U.S.A.        2015
## 1567 1494            Mesocacao         Honduras        2015
## 234  2246                Benns        Singapore        2018
## 372   502        Cacao Atlanta           U.S.A.        2010
## 1666 1399                Naive        Lithuania        2014
## 382  1391      Cacao de Origen        Venezuela        2014
##      country_of_bean_origin        specific_bean_origin_or_bar_name
## 345      Dominican Republic                              Elvesia P.
## 1567            El Salvador                             El Salvador
## 234                Malaysia                   Sungai Ruan, Koh Farm
## 372      Dominican Republic              Dominican Republic w/ nibs
## 1666              Venezuela                         Chuao, lot 0077
## 382               Venezuela Agua Fria; Sucre region, H. La Trinidad
##      cocoa_percent ingredients most_memorable_characteristics rating
## 345           0.80    3- B,S,C      bitter, dark berry, fatty   3.25
## 1567          0.70    3- B,S,C                 dairy, caramel   3.00
## 234           0.72  4- B,S,C,L    off aroma, smokey, off note   3.00
## 372           0.75      2- B,S             creamy, burnt wood   2.75
## 1666          0.75  4- B,S,C,L           coarse, sweet, minty   2.75
## 382           0.75      2- B,S          intense, earthy, fuel   2.50

Insight

Examining smaller groups within your data (subsamples) reveals how results might change when drawing different random samples from the entire population.

Significance

Recognizing this variability is vital for confidently applying your findings to the whole population. It helps judge the consistency and trustworthiness of conclusions based on a single sample.

Further Questions

Do patterns hold steady across different subsamples, or do results differ considerably? How much would conclusions change if we picked different random data sets? What factors explain the observed variations between subsample.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
subsamples <- list(df_1, df_2, df_3, df_4, df_5)

# Calculate summary statistics for each subsample
summary_stats <- lapply(subsamples, function(subsample) {
  subsample %>%
    group_by(country_of_bean_origin) %>%
    summary()
})

for (i in 1:5) {
  cat("\nSummary Stats for Subsample", i, ":\n")
  print(summary_stats[[i]])
}
## 
## Summary Stats for Subsample 1 :
##       ref       company_manufacturer company_location    review_date  
##  Min.   :  15   Length:1265          Length:1265        Min.   :2006  
##  1st Qu.: 833   Class :character     Class :character   1st Qu.:2012  
##  Median :1466   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1460                                           Mean   :2015  
##  3rd Qu.:2134                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:1265            Length:1265                      Min.   :0.4200  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7155  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:1265        Length:1265                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.176  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000  
## 
## Summary Stats for Subsample 2 :
##       ref       company_manufacturer company_location    review_date  
##  Min.   :  15   Length:1265          Length:1265        Min.   :2006  
##  1st Qu.: 781   Class :character     Class :character   1st Qu.:2011  
##  Median :1407   Mode  :character     Mode  :character   Median :2014  
##  Mean   :1392                                           Mean   :2014  
##  3rd Qu.:2036                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:1265            Length:1265                      Min.   :0.5300  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7171  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating    
##  Length:1265        Length:1265                    Min.   :1.00  
##  Class :character   Class :character               1st Qu.:3.00  
##  Mode  :character   Mode  :character               Median :3.25  
##                                                    Mean   :3.18  
##                                                    3rd Qu.:3.50  
##                                                    Max.   :4.00  
## 
## Summary Stats for Subsample 3 :
##       ref       company_manufacturer company_location    review_date  
##  Min.   :   5   Length:1265          Length:1265        Min.   :2006  
##  1st Qu.: 785   Class :character     Class :character   1st Qu.:2011  
##  Median :1454   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1428                                           Mean   :2014  
##  3rd Qu.:2084                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent  
##  Length:1265            Length:1265                      Min.   :0.530  
##  Class :character       Class :character                 1st Qu.:0.700  
##  Mode  :character       Mode  :character                 Median :0.700  
##                                                          Mean   :0.718  
##                                                          3rd Qu.:0.750  
##                                                          Max.   :1.000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:1265        Length:1265                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.175  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000  
## 
## Summary Stats for Subsample 4 :
##       ref       company_manufacturer company_location    review_date  
##  Min.   :  15   Length:1265          Length:1265        Min.   :2006  
##  1st Qu.: 781   Class :character     Class :character   1st Qu.:2011  
##  Median :1403   Mode  :character     Mode  :character   Median :2014  
##  Mean   :1391                                           Mean   :2014  
##  3rd Qu.:2056                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:1265            Length:1265                      Min.   :0.4200  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7156  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:1265        Length:1265                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.187  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000  
## 
## Summary Stats for Subsample 5 :
##       ref       company_manufacturer company_location    review_date  
##  Min.   :  15   Length:1265          Length:1265        Min.   :2006  
##  1st Qu.: 793   Class :character     Class :character   1st Qu.:2012  
##  Median :1450   Mode  :character     Mode  :character   Median :2015  
##  Mean   :1429                                           Mean   :2014  
##  3rd Qu.:2100                                           3rd Qu.:2018  
##  Max.   :2712                                           Max.   :2021  
##  country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent   
##  Length:1265            Length:1265                      Min.   :0.4600  
##  Class :character       Class :character                 1st Qu.:0.7000  
##  Mode  :character       Mode  :character                 Median :0.7000  
##                                                          Mean   :0.7131  
##                                                          3rd Qu.:0.7400  
##                                                          Max.   :1.0000  
##  ingredients        most_memorable_characteristics     rating     
##  Length:1265        Length:1265                    Min.   :1.000  
##  Class :character   Class :character               1st Qu.:3.000  
##  Mode  :character   Mode  :character               Median :3.250  
##                                                    Mean   :3.184  
##                                                    3rd Qu.:3.500  
##                                                    Max.   :4.000
# Number of Monte Carlo simulations
num_simulations <- 1000

# Monte Carlo simulation for mean ratings
monte_carlo_means <- replicate(num_simulations, {
  random_subsample <- data[sample(1:nrow(data), subsample_size, replace = TRUE), ]
  mean(random_subsample$rating)
})

# Plot the distribution of Monte Carlo means
hist(monte_carlo_means, main = "Monte Carlo Simulation of Mean Ratings", xlab = "Mean Rating")

# Identify anomalies in mean ratings
anomalies_ratings <- boxplot.stats(unlist(monte_carlo_means))$out

cat("Anomalies in Mean Ratings:\n")
## Anomalies in Mean Ratings:
print(anomalies_ratings)
## [1] 3.161265 3.159684 3.161462
# Check for outliers in the original dataset
outliers_df <- data %>%
  filter(rating %in% anomalies_ratings)

cat("\nRows with Anomalous Ratings in the Original Dataset:\n")
## 
## Rows with Anomalous Ratings in the Original Dataset:
print(outliers_df)
##  [1] ref                              company_manufacturer            
##  [3] company_location                 review_date                     
##  [5] country_of_bean_origin           specific_bean_origin_or_bar_name
##  [7] cocoa_percent                    ingredients                     
##  [9] most_memorable_characteristics   rating                          
## <0 rows> (or 0-length row.names)
# Identify anomalies in categorical variables (e.g., ingredients)
anomalies_ingredients <- data %>%
  group_by(ingredients) %>%
  summarize(count = n()) %>%
  filter(count <= 10)  

cat("\nPotential Anomalies in Ingredients (Occurring <= 10 times):\n")
## 
## Potential Anomalies in Ingredients (Occurring <= 10 times):
print(anomalies_ingredients)
## # A tibble: 13 × 2
##    ingredients    count
##    <chr>          <int>
##  1 1- B               6
##  2 2- B,C             1
##  3 3- B,S*,Sa         1
##  4 3- B,S,L           8
##  5 3- B,S,V           3
##  6 4- B,S*,C,L        2
##  7 4- B,S*,C,V        7
##  8 4- B,S*,V,L        3
##  9 4- B,S,C,Sa        5
## 10 4- B,S,V,L         5
## 11 5- B,S,C,L,Sa      1
## 12 5-B,S,C,V,Sa       6
## 13 6-B,S,C,V,L,Sa     4

Insights from the summary statistics

All datasets consistently span reviews from 2006 to 2021, suggesting a long-term perspective on chocolate preferences.

The average cocoa content remains consistent across groups, ranging from 71% to 72%. This finding aligns with typical preferences for moderately dark chocolate.

The variety of bean origins, bar names, ingredients, and memorable characteristics highlights the rich diversity of chocolate products within the dataset. This allows for deeper exploration of how these factors influence perception.

The majority of ratings fall within the “recommended” range of 3.0 to 3.5 across all datasets. This indicates generally positive sentiment towards the chocolates reviewed, but room for differentiation based on specific characteristics.

Significance

The observed consistency in mean ratings across subsamples, despite variations in specific details, is a strong indicator of the stability and reliability of the rating system. This suggests that individual preference differences are not significantly skewing the overall ratings, providing a solid foundation for further analysis. This consistency allows you to confidently draw conclusions about general trends and patterns without worrying about undue bias from individual reviewers.

Identifying consistent patterns in ingredients across subsamples is a valuable clue to understanding consumer preferences. By analyzing which ingredients frequently appear with higher ratings, you can identify specific characteristics that resonate with consumers. This information is crucial for chocolate makers as it helps them formulate products that better cater to consumer desires. It also helps consumers make informed choices when selecting chocolate based on their own preferences.

Further Questions

What are the most common ingredients in chocolates across all subsamples?

Are there specific ingredients that consistently result in higher or lower ratings?

Are there any noticeable temporal trends in ratings or ingredient preferences over the years?

Do certain countries consistently produce chocolates with higher ratings?

What are the distinctive features of chocolates from different countries?