Load in Coffee Data & necessary libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(htmltools)
packageVersion("htmltools")
## [1] '0.5.8.1'
library(readxl)
library(tibble)
library(ggplot2)
library(tidytext)
coffee <- read_csv("~/Downloads/coffee_analysis.csv")
## Rows: 2095 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, roaster, roast, loc_country, origin_1, origin_2, review_date...
## dbl (2): 100g_USD, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
coffee_clean <- read_csv("coffee_clean.csv")
## Rows: 2080 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, roaster, roast, loc_country, origin_1, origin_2, review_date...
## dbl (2): 100g_USD, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
‘100g_USD’ - price per 100g of coffee beans in US dollars. Without having the documentation behind what ‘100g_USD’ actually means, there is a decent amount of ambiguity resulting from this column name. While ‘100g_USD’ is representative of coffee price in my data set, it is important to understand that it is not the price that consumers are buying one cup of coffee from roasters at, but rather how much roasters buy 100g of coffee beans from suppliers. They likely chose to encode the data the way they did to accurately represent the price of coffee beans in the data set, rather than observers seeing a column such as ‘price’ and interpreting the value as the price that customers pay for coffee at a coffee shop.
‘loc_country’ - location of the coffee roaster. This is another variable that is not fully clear without reading the documentation of the data set. Without documentation, I could interpret ‘loc_country’ as location country where beans are grown, where they are shipped from, or where they are roasted and sold. ‘loc_country’ represents the country/region where the beans are roasted, but without the proper documentation, there would be a level of ambiguity. They likely chose to encode the location of the roaster as they did due to how they chose to encode other variables, as ‘origin_1’ and ‘origin_2’ represent the initial origins of the coffee beans and ‘roaster’ identifies the name of the roaster place so using context clues by looking at other columns, it could be inferred that ‘loc_country’ likely represents the location country of the roaster.
‘desc_2’ and ‘desc_3’ are the two columns from the ‘coffee_reviews’ data set that are still somewhat unclear after reading the documentation because they are undocumented. ‘desc_1’ is documented as the text of the review, and while ‘desc_2’ and ‘desc_3’ are assumed that they are also text of the review, it is unclear why there are 3 separate review text columns when it seems that a singular column would be adequate. Perhaps, each text description column is descriptive of a different element about coffee, yet reading through the columns, I do not deem a clear theme among the text in the different text review columns. As a result, these two elements are still somewhat unclear even after reading the ‘coffee_reviews’ documentation.
coffee_clean <- coffee_clean %>%
mutate(
desc_2_len = nchar(as.character(desc_2)),
desc_3_len = nchar(as.character(desc_3))
)
ggplot(coffee_clean, aes(x = desc_2_len, y = desc_3_len)) +
geom_point(alpha = 0.6, color = "blue") +
labs(
title = "desc_2 vs desc_3 Text Lengths",
x = "Length of desc_2",
y = "Length of desc_3"
) +
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Visual 1 Analysis: It appears that the text length of
‘desc_2’ is much more substantive in length than the reviews found in
‘desc_3’. This may be be a strength of both text review columns in a
different way. If I am looking for a quick insight that is comparable
among various coffees, ‘desc_3’ may be more effective in providing key
look up words often found in coffee reviews, that could be used to
distinguish between different types of coffee. In turn, if I wanted to
learn more about a very specific type of coffee, ‘desc_3’ reviews seem
far more likely to provide a vey in depth analysis with great substance
and detail.
# Clean and tokenize desc_2
desc2_words <- coffee_clean %>%
select(desc_2) %>%
mutate(desc_2 = as.character(desc_2)) %>%
unnest_tokens(word, desc_2) %>%
filter(!word %in% stop_words$word) %>% # remove common English stop words
count(word, sort = TRUE) %>%
slice_head(n = 15)
# Plot
ggplot(desc2_words, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
labs(
title = "Top Words Appearing in desc_2",
x = "Word",
y = "Frequency"
) +
theme_minimal()
Visual 2 Analysis: I created this word appearance
visualization which brings awareness to the potential risk/downside of
‘desc_2’ that I mentioned above regarding using this column to compare
coffees. While a detailed review is great when wanting to gain
information on a specific coffee, it will be hard to compare coffees
based on reviews from ‘desc_2’ because it will be highly difficult to
retrieve key words to compare and contrast coffees. As seen in the
visual above, the the most frequent words are not actual coffee
descriptors such as the kind of roast, flavor pallette, or origin, but
rather very general words such as ‘coffee’, ‘produced’, and ‘beans’.
‘desc_3’ may be slightly easier to retrieve certain key words that could
be gathered to compare different coffees but neither of these coffee
text reviews will make it easy to compare coffees based on the length of
each individual text review.
Categorical Columns: ‘roast’ & ‘loc_country’
# Check explicit missing rows for 'roast' & 'loc_country'
coffee[is.na(coffee$roast), ]
## # A tibble: 15 × 12
## name roaster roast loc_country origin_1 origin_2 `100g_USD` rating
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Cold Brew Barefo… <NA> United Sta… Guatema… Not Dis… 0.12 90
## 2 Esmeralda Esta… Differ… <NA> England Boquete… Western… 111. 94
## 3 Esmeralda Esta… Differ… <NA> England Boquete… Western… 111. 94
## 4 Asfaw Maru Eth… Collag… <NA> United Sta… Yirgach… Souther… 1.32 93
## 5 Ethiopia Nano … Bonfir… <NA> United Sta… Jimma Z… Oromia … 1.34 94
## 6 Brazil Ipanema… GK Cof… <NA> Taiwan Ipanema Brazil 4.27 94
## 7 Esmeralda Esta… Differ… <NA> England Boquete… Western… 100 94
## 8 Brazil Ipanema… GK Cof… <NA> Taiwan Ipanema Brazil 4 94
## 9 Esmeralda Esta… Differ… <NA> England Boquete… Western… 111. 94
## 10 Ethiopia Guji … Taster… <NA> Taiwan Guji Zo… Oromia … 3.17 93
## 11 Yemen Lot 106 Port o… <NA> United Sta… Al Haym… Sana’A … 39.7 96
## 12 Yemen Al Wadi Port o… <NA> United Sta… Al Wadi… Sana’A … 28.2 94
## 13 Ethiopia Guji … Taster… <NA> Taiwan Guji Zo… Oromia … 3.17 93
## 14 Taster’s Blend… Taster… <NA> Taiwan Colombia Guatema… 3.17 92
## 15 Esmeralda Esta… Differ… <NA> England Boquete… Western… 111. 94
## # ℹ 4 more variables: review_date <chr>, desc_1 <chr>, desc_2 <chr>,
## # desc_3 <chr>
coffee[is.na(coffee$loc_country), ]
## # A tibble: 0 × 12
## # ℹ 12 variables: name <chr>, roaster <chr>, roast <chr>, loc_country <chr>,
## # origin_1 <chr>, origin_2 <chr>, 100g_USD <dbl>, rating <dbl>,
## # review_date <chr>, desc_1 <chr>, desc_2 <chr>, desc_3 <chr>
There are 15 explicit missing rows in ‘roast’ as the value that is returned is ‘NA’ while there are no explicit missing rows in ‘loc_country’. In the ‘coffee_clean’ dataframe, I filtered out the ‘NA’ values in the ‘roast’ column.
There are no implicit missing rows possible for ‘loc_country’ as there are no explicit missing rows to begin with. Additionally, in ‘roast’ there are no implicit missing rows despite 15 explicit missing rows, as the rest of the values appear such as ‘name’ and ‘roaster’ despite ‘NA’ values appearing in the ‘roast’ column. Despite the explicit missing rows, it is good news that there is data is not missing from the entire row. ## 3. Empty Groups
coffee %>%
count(roast, drop = FALSE) # drop = FALSE keeps empty factor levels
## # A tibble: 6 × 3
## roast drop n
## <chr> <lgl> <int>
## 1 Dark FALSE 5
## 2 Light FALSE 287
## 3 Medium FALSE 259
## 4 Medium-Dark FALSE 39
## 5 Medium-Light FALSE 1490
## 6 <NA> FALSE 15
coffee %>%
count(loc_country, drop = FALSE) # drop = FALSE keeps empty factor levels
## # A tibble: 19 × 3
## loc_country drop n
## <chr> <lgl> <int>
## 1 Australia FALSE 2
## 2 Belgium FALSE 1
## 3 Canada FALSE 31
## 4 China FALSE 4
## 5 England FALSE 7
## 6 Guatemala FALSE 29
## 7 Hawai'i FALSE 87
## 8 Honduras FALSE 1
## 9 Hong Kong FALSE 20
## 10 Japan FALSE 12
## 11 Kenya FALSE 1
## 12 Malaysia FALSE 3
## 13 Mexico FALSE 2
## 14 New Taiwan FALSE 1
## 15 Peru FALSE 2
## 16 Taiwan FALSE 554
## 17 Uganda FALSE 1
## 18 United States FALSE 1336
## 19 United States And Floyd FALSE 1
There are no empty groups evident for either of these categorical variables. Despite a very low amount of reviews for dark coffee roasts (n = 5) and many countries that only appear once in the data set, to qualify as an empty group, n = 0 would neet to be met, which it is not for these two categorical columns, as all of these rows have at least one observation where n = 1.
Find outlier(s) for ‘100g_USD’
# Computing outlier thresholds by finding upper and lower bounds
price_stats <- coffee_clean %>%
summarise(
Q1 = quantile(`100g_USD`, 0.25, na.rm = TRUE),
Q3 = quantile(`100g_USD`, 0.75, na.rm = TRUE)
) %>%
mutate(
IQR = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR,
upper_bound = Q3 + 1.5 * IQR
)
price_stats
## # A tibble: 1 × 5
## Q1 Q3 IQR lower_bound upper_bound
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.93 8.77 3.84 -0.83 14.5
# Identify outliers in dataset
coffee_clean <- coffee_clean %>%
mutate(
is_outlier = `100g_USD` < price_stats$lower_bound |
`100g_USD` > price_stats$upper_bound
)
library(ggplot2)
ggplot(coffee_clean, aes(x = "", y = `100g_USD`, color = is_outlier)) +
geom_boxplot(outlier.shape = NA) + # hide default outliers
geom_jitter(width = 0.1, alpha = 0.6, size = 2) +
scale_color_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick")) +
labs(
title = "Boxplot of 100g_USD with Outliers Highlighted",
x = "",
y = "Price (100g USD)",
color = "Outlier?"
) +
theme_minimal()
Outlier Identification Method: The continuous column I
focused on to identify outliers is ‘100g_USD’. From working with this
data set for a few weeks now, I observed a very wide range of coffee
prices per 100g in USD, which is why I wanted to investigate outliers.
As we discussed in class, defining outliers can be subjective but I
chose to define outliers as price observations that occur above the
upper bound resulting from inner quartile range calculations. The price
range among coffees is truly as outstanding as it is remarkable, but I
could define outliers as the coffees priced over 50 USD which could be
roughly 15 of the 2000 plus observations, although I chose to stick to
the statistical definition of outliers most commonly used, which results
in many more outliers, although the upper bound falls at 14.53 USD, so
technically, any price above the upper bound is considered an outlier,
as I defined here.