library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(htmltools)
packageVersion("htmltools")
## [1] '0.5.8.1'
library(readxl)
library(tibble)
library(ggplot2)
# Load in data
coffee <- read_csv("~/Downloads/coffee_analysis.csv")
## Rows: 2095 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, roaster, roast, loc_country, origin_1, origin_2, review_date...
## dbl (2): 100g_USD, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show column types
spec(coffee) # double = numeric
## cols(
## name = col_character(),
## roaster = col_character(),
## roast = col_character(),
## loc_country = col_character(),
## origin_1 = col_character(),
## origin_2 = col_character(),
## `100g_USD` = col_double(),
## rating = col_double(),
## review_date = col_character(),
## desc_1 = col_character(),
## desc_2 = col_character(),
## desc_3 = col_character()
## )
Converting the categorical variables (roast, loc_country, origin_1) to factors will make statistical analysis and visualization far easier. Factors are simply another way to represent categorical variables and will make my life easier.
# Convert key categorical columns to factors
coffee <- coffee %>%
mutate(
roast = as.factor(roast),
loc_country = as.factor(loc_country),
origin_1 = as.factor(origin_1)
)
# Using group_by to group roast by price&rating
# The 'roast_summary' DataFrame will provide insights into the price, rating, and number of reviews (n) each type of coffee roast has received
roast_summary <- coffee %>%
group_by(roast) %>%
summarise(
avg_price = mean(`100g_USD`, na.rm = TRUE),
avg_rating = mean(rating, na.rm = TRUE),
n = n()
) %>%
arrange(n)
roast_summary
## # A tibble: 6 × 4
## roast avg_price avg_rating n
## <fct> <dbl> <dbl> <int>
## 1 Dark 4.46 88.2 5
## 2 <NA> 42.2 93.5 15
## 3 Medium-Dark 9.51 91.9 39
## 4 Medium 7.06 92.3 259
## 5 Light 11.4 93.5 287
## 6 Medium-Light 8.99 93.2 1490
The ‘n’ represents the total number of observations by each row. For Dark roast, n=5, which is the lowest number of observations by row, meaning it will have the lowest probability of being selected if we were to randomly select a row from the dataset. In this dataset, dark roast coffee has the average lowest price, rating, and number of reviews. These insights tell us that dark roasts are far less popular and enjoyed than other coffee roasts.
# Finding the roast with the lowest probability to be chosen
roast_summary %>% slice(1)
## # A tibble: 1 × 4
## roast avg_price avg_rating n
## <fct> <dbl> <dbl> <int>
## 1 Dark 4.46 88.2 5
The ‘roast_summary’ below shows the difference in random probability to be chosen if we were to select one row from the dataset, highlighting that ‘Dark’ roast is far less likely to be selected than the other coffee roasts. ‘Medium-Light’ coffee roasts have by far the greatest chance of being selected, as it has received far and away the most reviews in this dataset.
roast_summary %>%
mutate(probability = n / sum(n))
## # A tibble: 6 × 5
## roast avg_price avg_rating n probability
## <fct> <dbl> <dbl> <int> <dbl>
## 1 Dark 4.46 88.2 5 0.00239
## 2 <NA> 42.2 93.5 15 0.00716
## 3 Medium-Dark 9.51 91.9 39 0.0186
## 4 Medium 7.06 92.3 259 0.124
## 5 Light 11.4 93.5 287 0.137
## 6 Medium-Light 8.99 93.2 1490 0.711
ggplot(roast_summary, aes(x = roast, y = avg_rating)) +
geom_col() +
labs(title = "Average Rating by Roast Level",
x = "Roast",
y = "Average Rating")
Conclusion - Medium-Light coffee has received far and away the most reviews in this coffee reviews dataset. Additionally, Light to Medium-Light coffee are the two most popular coffee roasts, being the only blends that averaged ratings over 93. Testable Hypothesis - Coffee roasts with fewer reviews are not because consumers do not like dark roast coffee, but because it is more rare to find.
I am aiming to analyze which regions/countries grow the best coffee in the world. I will gain insights through finding with places of origin receive the highest ratings.
origin_summary <- coffee %>%
group_by(origin_1) %>%
summarise(
avg_rating = mean(rating, na.rm = TRUE),
count = n() # Create a count column to filter by
) %>%
filter(count >= 10) %>% # Only keep groups with at least 10 occurrences
ungroup()
origin_summary
## # A tibble: 40 × 3
## origin_1 avg_rating count
## <fct> <dbl> <int>
## 1 Acatenango Growing Region 93.5 14
## 2 Aceh Province 93.8 12
## 3 Agaro Gera 93.7 10
## 4 Alishan 93.5 10
## 5 Antigua 90.5 11
## 6 Antigua Growing Region 91.9 25
## 7 Bench-Maji Zone 94.1 11
## 8 Boquete Growing Region 94.6 39
## 9 Brazil 91.8 27
## 10 Caicedonia 94.3 11
## # ℹ 30 more rows
I chose to filter out places of origin that received less than 10 reviews to eliminate higher variance due to small sample sizes.
origin_summary %>% slice(1)
## # A tibble: 1 × 3
## origin_1 avg_rating count
## <fct> <dbl> <int>
## 1 Acatenango Growing Region 93.5 14
# Count replaces 'n' in the mutate function since I created a count column to filter by to only keep places of origin with at least 10 reviews in the 'group_by' function; 'count=n()'
origin_summary %>%
mutate(probability = count / sum(count))
## # A tibble: 40 × 4
## origin_1 avg_rating count probability
## <fct> <dbl> <int> <dbl>
## 1 Acatenango Growing Region 93.5 14 0.0139
## 2 Aceh Province 93.8 12 0.0119
## 3 Agaro Gera 93.7 10 0.00990
## 4 Alishan 93.5 10 0.00990
## 5 Antigua 90.5 11 0.0109
## 6 Antigua Growing Region 91.9 25 0.0248
## 7 Bench-Maji Zone 94.1 11 0.0109
## 8 Boquete Growing Region 94.6 39 0.0386
## 9 Brazil 91.8 27 0.0267
## 10 Caicedonia 94.3 11 0.0109
## # ℹ 30 more rows
ggplot(origin_summary %>% filter(count >= 15),
aes(x = reorder(origin_1, avg_rating), y = avg_rating)) +
geom_col() +
coord_flip() +
labs(title = "Average Rating by Origin (min 15 coffees)",
x = "Origin",
y = "Average Rating")
Conclusion - Certain countries of origin that stand out, receiving excellent reviews for coffee grown in their homeland include Costa Rica, Columbia, Guatemala, and El Salvador. These countries located in Central and South America all received an average rating of at least 92/100 for coffee grown in their country, highlighting Central and South America as one of the best coffee growing regions in the world!
Testable Hypothesis - There is a weak correlation between the average rating being better as a result of more reviews for a coffee.
I am creating categorical price ranges based on coffee ratings.
# Using cut() to break coffee prices per 100g_USD into 4 price bins
coffee <- coffee %>%
mutate(
price_bucket = cut(
`100g_USD`,
breaks = c(-Inf, 4.99, 9.99, 19.99, Inf),
labels = c("$0-5", "$5-10", "$10-20", "$20<="),
right = TRUE
)
)
# Grouped summary by custom buckets
price_summary <- coffee %>%
group_by(price_bucket) %>%
summarise(
avg_rating = mean(rating, na.rm = TRUE),
avg_price = mean(`100g_USD`, na.rm = TRUE),
n = n()
) %>%
arrange(n)
price_summary
## # A tibble: 4 × 4
## price_bucket avg_rating avg_price n
## <fct> <dbl> <dbl> <int>
## 1 $20<= 94.3 37.2 177
## 2 $10-20 93.5 14.0 258
## 3 $0-5 92.4 4.12 586
## 4 $5-10 93.2 6.43 1074
price_summary %>% slice(1)
## # A tibble: 1 × 4
## price_bucket avg_rating avg_price n
## <fct> <dbl> <dbl> <int>
## 1 $20<= 94.3 37.2 177
The correlation between higher average coffee price and higher average rating makes a ton of sense, as well as the indication that higher price is a result of rarer coffee grounds, increasing their value as well.
# Visualizing average coffee ratings by price per 100g USD
ggplot(price_summary,
aes(x = price_bucket, y = avg_rating, fill = price_bucket)) +
geom_col() +
scale_fill_manual(values = c(
"$0-5" = "lightgray",
"$5-10" = "skyblue",
"$10-20" = "lightgreen",
"$20<=" = "green"
)) +
labs(
title = "Average Rating by Custom Price Bucket",
x = "Price Bucket (USD per 100g)",
y = "Average Rating",
fill = "Price Bucket"
)
Conclusion - Extremely high prices for coffee worldwide that are priced 20 USD or higher per 100g are rather rare, as it seems that most coffee is reasonably priced, with 1,074 out of 2,095 total coffees in the dataset being priced in between 5-10 USD per 100g. With that said, the higher priced coffees do yield the highest average rating of a 94.3/100, indicating a that the highest priced coffees return great value and quality. Testable Hypothesis - Coffees that fall under the highest price bucket average ratings that are significantly higher than coffees priced in the lowest price bucket, reflecting the incredible quality, value, and rarity that create the steep price of certain coffees.
Building data frame with all combinations for two categorical variables.
# Find all existing combinations for 'roast' and 'loc_country'
# Cleaning data by creating coffee_clean to remove NA roasts
coffee_clean <- coffee %>%
filter(!is.na(roast))
# Count 'roast' X 'loc_country
# Adding additional columns for visualization purposes
roast_country_counts <- coffee_clean %>%
count(loc_country, roast) %>%
arrange(loc_country, roast) %>% # Order by country, then roast
group_by(loc_country) %>%
mutate(
dominant_roast = roast[which.max(n)],
is_dominant = roast == dominant_roast
) %>%
ungroup()
roast_country_counts
## # A tibble: 41 × 5
## loc_country roast n dominant_roast is_dominant
## <fct> <fct> <int> <fct> <lgl>
## 1 Australia Medium-Light 2 Medium-Light TRUE
## 2 Belgium Medium 1 Medium TRUE
## 3 Canada Light 1 Medium-Light FALSE
## 4 Canada Medium 7 Medium-Light FALSE
## 5 Canada Medium-Light 23 Medium-Light TRUE
## 6 China Light 2 Light TRUE
## 7 China Medium-Light 2 Light FALSE
## 8 England Medium-Light 2 Medium-Light TRUE
## 9 Guatemala Light 5 Medium-Light FALSE
## 10 Guatemala Medium 5 Medium-Light FALSE
## # ℹ 31 more rows
all_combos <- expand.grid(
roast = unique(coffee_clean$roast),
loc_country = unique(coffee_clean$loc_country)
)
combo_check <- all_combos %>%
left_join(roast_country_counts, by = c("roast", "loc_country")) %>%
replace_na(list(n = 0))
combo_check
## roast loc_country n dominant_roast is_dominant
## 1 Medium-Light Hong Kong 16 Medium-Light TRUE
## 2 Medium Hong Kong 2 Medium-Light FALSE
## 3 Light Hong Kong 2 Medium-Light FALSE
## 4 Medium-Dark Hong Kong 0 <NA> NA
## 5 Dark Hong Kong 0 <NA> NA
## 6 Medium-Light United States 995 Medium-Light TRUE
## 7 Medium United States 151 Medium-Light FALSE
## 8 Light United States 166 Medium-Light FALSE
## 9 Medium-Dark United States 15 Medium-Light FALSE
## 10 Dark United States 4 Medium-Light FALSE
## 11 Medium-Light Canada 23 Medium-Light TRUE
## 12 Medium Canada 7 Medium-Light FALSE
## 13 Light Canada 1 Medium-Light FALSE
## 14 Medium-Dark Canada 0 <NA> NA
## 15 Dark Canada 0 <NA> NA
## 16 Medium-Light Taiwan 355 Medium-Light TRUE
## 17 Medium Taiwan 73 Medium-Light FALSE
## 18 Light Taiwan 99 Medium-Light FALSE
## 19 Medium-Dark Taiwan 21 Medium-Light FALSE
## 20 Dark Taiwan 1 Medium-Light FALSE
## 21 Medium-Light Hawai'i 66 Medium-Light TRUE
## 22 Medium Hawai'i 18 Medium-Light FALSE
## 23 Light Hawai'i 2 Medium-Light FALSE
## 24 Medium-Dark Hawai'i 1 Medium-Light FALSE
## 25 Dark Hawai'i 0 <NA> NA
## 26 Medium-Light Australia 2 Medium-Light TRUE
## 27 Medium Australia 0 <NA> NA
## 28 Light Australia 0 <NA> NA
## 29 Medium-Dark Australia 0 <NA> NA
## 30 Dark Australia 0 <NA> NA
## 31 Medium-Light Uganda 0 <NA> NA
## 32 Medium Uganda 1 Medium TRUE
## 33 Light Uganda 0 <NA> NA
## 34 Medium-Dark Uganda 0 <NA> NA
## 35 Dark Uganda 0 <NA> NA
## 36 Medium-Light Mexico 1 Medium FALSE
## 37 Medium Mexico 1 Medium TRUE
## 38 Light Mexico 0 <NA> NA
## 39 Medium-Dark Mexico 0 <NA> NA
## 40 Dark Mexico 0 <NA> NA
## 41 Medium-Light Belgium 0 <NA> NA
## 42 Medium Belgium 1 Medium TRUE
## 43 Light Belgium 0 <NA> NA
## 44 Medium-Dark Belgium 0 <NA> NA
## 45 Dark Belgium 0 <NA> NA
## 46 Medium-Light United States And Floyd 1 Medium-Light TRUE
## 47 Medium United States And Floyd 0 <NA> NA
## 48 Light United States And Floyd 0 <NA> NA
## 49 Medium-Dark United States And Floyd 0 <NA> NA
## 50 Dark United States And Floyd 0 <NA> NA
## 51 Medium-Light Guatemala 18 Medium-Light TRUE
## 52 Medium Guatemala 5 Medium-Light FALSE
## 53 Light Guatemala 5 Medium-Light FALSE
## 54 Medium-Dark Guatemala 1 Medium-Light FALSE
## 55 Dark Guatemala 0 <NA> NA
## 56 Medium-Light Japan 2 Light FALSE
## 57 Medium Japan 0 <NA> NA
## 58 Light Japan 9 Light TRUE
## 59 Medium-Dark Japan 1 Light FALSE
## 60 Dark Japan 0 <NA> NA
## 61 Medium-Light Peru 2 Medium-Light TRUE
## 62 Medium Peru 0 <NA> NA
## 63 Light Peru 0 <NA> NA
## 64 Medium-Dark Peru 0 <NA> NA
## 65 Dark Peru 0 <NA> NA
## 66 Medium-Light Honduras 1 Medium-Light TRUE
## 67 Medium Honduras 0 <NA> NA
## 68 Light Honduras 0 <NA> NA
## 69 Medium-Dark Honduras 0 <NA> NA
## 70 Dark Honduras 0 <NA> NA
## 71 Medium-Light England 2 Medium-Light TRUE
## 72 Medium England 0 <NA> NA
## 73 Light England 0 <NA> NA
## 74 Medium-Dark England 0 <NA> NA
## 75 Dark England 0 <NA> NA
## 76 Medium-Light China 2 Light FALSE
## 77 Medium China 0 <NA> NA
## 78 Light China 2 Light TRUE
## 79 Medium-Dark China 0 <NA> NA
## 80 Dark China 0 <NA> NA
## 81 Medium-Light Kenya 1 Medium-Light TRUE
## 82 Medium Kenya 0 <NA> NA
## 83 Light Kenya 0 <NA> NA
## 84 Medium-Dark Kenya 0 <NA> NA
## 85 Dark Kenya 0 <NA> NA
## 86 Medium-Light Malaysia 3 Medium-Light TRUE
## 87 Medium Malaysia 0 <NA> NA
## 88 Light Malaysia 0 <NA> NA
## 89 Medium-Dark Malaysia 0 <NA> NA
## 90 Dark Malaysia 0 <NA> NA
## 91 Medium-Light New Taiwan 0 <NA> NA
## 92 Medium New Taiwan 0 <NA> NA
## 93 Light New Taiwan 1 Light TRUE
## 94 Medium-Dark New Taiwan 0 <NA> NA
## 95 Dark New Taiwan 0 <NA> NA
missing_combos <- combo_check %>% filter(n == 0)
missing_combos
## roast loc_country n dominant_roast is_dominant
## 1 Medium-Dark Hong Kong 0 <NA> NA
## 2 Dark Hong Kong 0 <NA> NA
## 3 Medium-Dark Canada 0 <NA> NA
## 4 Dark Canada 0 <NA> NA
## 5 Dark Hawai'i 0 <NA> NA
## 6 Medium Australia 0 <NA> NA
## 7 Light Australia 0 <NA> NA
## 8 Medium-Dark Australia 0 <NA> NA
## 9 Dark Australia 0 <NA> NA
## 10 Medium-Light Uganda 0 <NA> NA
## 11 Light Uganda 0 <NA> NA
## 12 Medium-Dark Uganda 0 <NA> NA
## 13 Dark Uganda 0 <NA> NA
## 14 Light Mexico 0 <NA> NA
## 15 Medium-Dark Mexico 0 <NA> NA
## 16 Dark Mexico 0 <NA> NA
## 17 Medium-Light Belgium 0 <NA> NA
## 18 Light Belgium 0 <NA> NA
## 19 Medium-Dark Belgium 0 <NA> NA
## 20 Dark Belgium 0 <NA> NA
## 21 Medium United States And Floyd 0 <NA> NA
## 22 Light United States And Floyd 0 <NA> NA
## 23 Medium-Dark United States And Floyd 0 <NA> NA
## 24 Dark United States And Floyd 0 <NA> NA
## 25 Dark Guatemala 0 <NA> NA
## 26 Medium Japan 0 <NA> NA
## 27 Dark Japan 0 <NA> NA
## 28 Medium Peru 0 <NA> NA
## 29 Light Peru 0 <NA> NA
## 30 Medium-Dark Peru 0 <NA> NA
## 31 Dark Peru 0 <NA> NA
## 32 Medium Honduras 0 <NA> NA
## 33 Light Honduras 0 <NA> NA
## 34 Medium-Dark Honduras 0 <NA> NA
## 35 Dark Honduras 0 <NA> NA
## 36 Medium England 0 <NA> NA
## 37 Light England 0 <NA> NA
## 38 Medium-Dark England 0 <NA> NA
## 39 Dark England 0 <NA> NA
## 40 Medium China 0 <NA> NA
## 41 Medium-Dark China 0 <NA> NA
## 42 Dark China 0 <NA> NA
## 43 Medium Kenya 0 <NA> NA
## 44 Light Kenya 0 <NA> NA
## 45 Medium-Dark Kenya 0 <NA> NA
## 46 Dark Kenya 0 <NA> NA
## 47 Medium Malaysia 0 <NA> NA
## 48 Light Malaysia 0 <NA> NA
## 49 Medium-Dark Malaysia 0 <NA> NA
## 50 Dark Malaysia 0 <NA> NA
## 51 Medium-Light New Taiwan 0 <NA> NA
## 52 Medium New Taiwan 0 <NA> NA
## 53 Medium-Dark New Taiwan 0 <NA> NA
## 54 Dark New Taiwan 0 <NA> NA
Analyzing most and least common Roast X Country combinations
roast_country_counts %>% slice_max(n, n = 5)
## # A tibble: 5 × 5
## loc_country roast n dominant_roast is_dominant
## <fct> <fct> <int> <fct> <lgl>
## 1 United States Medium-Light 995 Medium-Light TRUE
## 2 Taiwan Medium-Light 355 Medium-Light TRUE
## 3 United States Light 166 Medium-Light FALSE
## 4 United States Medium 151 Medium-Light FALSE
## 5 Taiwan Light 99 Medium-Light FALSE
roast_country_counts %>% slice_min(n, n = 5)
## # A tibble: 13 × 5
## loc_country roast n dominant_roast is_dominant
## <fct> <fct> <int> <fct> <lgl>
## 1 Belgium Medium 1 Medium TRUE
## 2 Canada Light 1 Medium-Light FALSE
## 3 Guatemala Medium-Dark 1 Medium-Light FALSE
## 4 Hawai'i Medium-Dark 1 Medium-Light FALSE
## 5 Honduras Medium-Light 1 Medium-Light TRUE
## 6 Japan Medium-Dark 1 Light FALSE
## 7 Kenya Medium-Light 1 Medium-Light TRUE
## 8 Mexico Medium 1 Medium TRUE
## 9 Mexico Medium-Light 1 Medium FALSE
## 10 New Taiwan Light 1 Light TRUE
## 11 Taiwan Dark 1 Medium-Light FALSE
## 12 Uganda Medium 1 Medium TRUE
## 13 United States And Floyd Medium-Light 1 Medium-Light TRUE
It is evident that the United States and Taiwan have received the most amount of coffee reviews, and that medium-light roast is far and away the most popular roast that consumers drink. The number of observations that combine the United States with medium-light roasts and Taiwan with medium-light roasts makes up roughly half of the total observations in the entire data set!
Coffee X Loc_Country Visualization
ggplot(roast_country_counts,
aes(x = roast,
y = loc_country,
size = n,
color = dominant_roast)) +
geom_point(alpha = 0.7) +
labs(
title = "Roast × Roaster Country",
x = "Roast",
y = "Roaster Country",
size = "Count",
color = "Most Common Roast"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
This visualizations displays type of roast and roaster country
combinations, highlighting the most common roast for each country.
Medium-dark and dark roasts both did not make the visual as the most
common roast for any country at all, while medium-light roast dominated
this chart, as light blue dots are prevalent. The United States and
Taiwan are the leading coffee roasters in the world based on this
dataset, as they dominate the number of observations in this data set
combining roast and location country, which is represented by their dot
sizes, with those two countries having the largest dots. While the
United States and Taiwan dominate across all roasts in roasting coffee,
many other countries specialize in one or two roast levels, with
medium-light receiving worldwide popularity.