Load in Coffee Data & necessary libraries

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(htmltools)
packageVersion("htmltools")
## [1] '0.5.8.1'
library(readxl)
library(tibble)
library(ggplot2)
library(tidytext)

coffee <- read_csv("~/Downloads/coffee_analysis.csv")
## Rows: 2095 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, roaster, roast, loc_country, origin_1, origin_2, review_date...
## dbl  (2): 100g_USD, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
coffee_clean <- read_csv("coffee_clean.csv")
## Rows: 2080 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, roaster, roast, loc_country, origin_1, origin_2, review_date...
## dbl  (2): 100g_USD, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Unclear Columns

  1. ‘100g_USD’ - price per 100g of coffee beans in US dollars. Without having the documentation behind what ‘100g_USD’ actually means, there is a decent amount of ambiguity resulting from this column name. While ‘100g_USD’ is representative of coffee price in my data set, it is important to understand that it is not the price that consumers are buying one cup of coffee from roasters at, but rather how much roasters buy 100g of coffee beans from suppliers. They likely chose to encode the data the way they did to accurately represent the price of coffee beans in the data set, rather than observers seeing a column such as ‘price’ and interpreting the value as the price that customers pay for coffee at a coffee shop.

  2. ‘loc_country’ - location of the coffee roaster. This is another variable that is not fully clear without reading the documentation of the data set. Without documentation, I could interpret ‘loc_country’ as location country where beans are grown, where they are shipped from, or where they are roasted and sold. ‘loc_country’ represents the country/region where the beans are roasted, but without the proper documentation, there would be a level of ambiguity. They likely chose to encode the location of the roaster as they did due to how they chose to encode other variables, as ‘origin_1’ and ‘origin_2’ represent the initial origins of the coffee beans and ‘roaster’ identifies the name of the roaster place so using context clues by looking at other columns, it could be inferred that ‘loc_country’ likely represents the location country of the roaster.

2. Unclear Element

‘desc_2’ and ‘desc_3’ are the two columns from the ‘coffee_reviews’ data set that are still somewhat unclear after reading the documentation because they are undocumented. ‘desc_1’ is documented as the text of the review, and while ‘desc_2’ and ‘desc_3’ are assumed that they are also text of the review, it is unclear why there are 3 separate review text columns when it seems that a singular column would be adequate. Perhaps, each text description column is descriptive of a different element about coffee, yet reading through the columns, I do not deem a clear theme among the text in the different text review columns. As a result, these two elements are still somewhat unclear even after reading the ‘coffee_reviews’ documentation.

3. Build Visualizations

Visual 1: Comparing texts lengths between ‘desc_2’ and ‘desc_3’

coffee_clean <- coffee_clean %>%
  mutate(
    desc_2_len = nchar(as.character(desc_2)),
    desc_3_len = nchar(as.character(desc_3))
  )

ggplot(coffee_clean, aes(x = desc_2_len, y = desc_3_len)) +
  geom_point(alpha = 0.6, color = "blue") +
  labs(
    title = "desc_2 vs desc_3 Text Lengths",
    x = "Length of desc_2",
    y = "Length of desc_3"
  ) +
  theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Visual 1 Analysis: It appears that the text length of ‘desc_2’ is much more substantive in length than the reviews found in ‘desc_3’. This may be be a strength of both text review columns in a different way. If I am looking for a quick insight that is comparable among various coffees, ‘desc_3’ may be more effective in providing key look up words often found in coffee reviews, that could be used to distinguish between different types of coffee. In turn, if I wanted to learn more about a very specific type of coffee, ‘desc_3’ reviews seem far more likely to provide a vey in depth analysis with great substance and detail.

Visual 2: Gain insight into most frequent word appearances in ‘desc_2’

# Clean and tokenize desc_2
desc2_words <- coffee_clean %>%
  select(desc_2) %>%
  mutate(desc_2 = as.character(desc_2)) %>%
  unnest_tokens(word, desc_2) %>%
  filter(!word %in% stop_words$word) %>%   # remove common English stop words
  count(word, sort = TRUE) %>%
  slice_head(n = 15)                       

# Plot
ggplot(desc2_words, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  labs(
    title = "Top Words Appearing in desc_2",
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

Visual 2 Analysis: I created this word appearance visualization which brings awareness to the potential risk/downside of ‘desc_2’ that I mentioned above regarding using this column to compare coffees. While a detailed review is great when wanting to gain information on a specific coffee, it will be hard to compare coffees based on reviews from ‘desc_2’ because it will be highly difficult to retrieve key words to compare and contrast coffees. As seen in the visual above, the the most frequent words are not actual coffee descriptors such as the kind of roast, flavor pallette, or origin, but rather very general words such as ‘coffee’, ‘produced’, and ‘beans’. ‘desc_3’ may be slightly easier to retrieve certain key words that could be gathered to compare different coffees but neither of these coffee text reviews will make it easy to compare coffees based on the length of each individual text review.

4. Categorical Column Analysis

Categorical Columns: ‘roast’ & ‘loc_country’

1. Explict Missing Rows

# Check explicit missing rows for 'roast' & 'loc_country'
coffee[is.na(coffee$roast), ]
## # A tibble: 15 × 12
##    name            roaster roast loc_country origin_1 origin_2 `100g_USD` rating
##    <chr>           <chr>   <chr> <chr>       <chr>    <chr>         <dbl>  <dbl>
##  1 Cold Brew       Barefo… <NA>  United Sta… Guatema… Not Dis…       0.12     90
##  2 Esmeralda Esta… Differ… <NA>  England     Boquete… Western…     111.       94
##  3 Esmeralda Esta… Differ… <NA>  England     Boquete… Western…     111.       94
##  4 Asfaw Maru Eth… Collag… <NA>  United Sta… Yirgach… Souther…       1.32     93
##  5 Ethiopia Nano … Bonfir… <NA>  United Sta… Jimma Z… Oromia …       1.34     94
##  6 Brazil Ipanema… GK Cof… <NA>  Taiwan      Ipanema  Brazil         4.27     94
##  7 Esmeralda Esta… Differ… <NA>  England     Boquete… Western…     100        94
##  8 Brazil Ipanema… GK Cof… <NA>  Taiwan      Ipanema  Brazil         4        94
##  9 Esmeralda Esta… Differ… <NA>  England     Boquete… Western…     111.       94
## 10 Ethiopia Guji … Taster… <NA>  Taiwan      Guji Zo… Oromia …       3.17     93
## 11 Yemen Lot 106   Port o… <NA>  United Sta… Al Haym… Sana’A …      39.7      96
## 12 Yemen Al Wadi   Port o… <NA>  United Sta… Al Wadi… Sana’A …      28.2      94
## 13 Ethiopia Guji … Taster… <NA>  Taiwan      Guji Zo… Oromia …       3.17     93
## 14 Taster’s Blend… Taster… <NA>  Taiwan      Colombia Guatema…       3.17     92
## 15 Esmeralda Esta… Differ… <NA>  England     Boquete… Western…     111.       94
## # ℹ 4 more variables: review_date <chr>, desc_1 <chr>, desc_2 <chr>,
## #   desc_3 <chr>
coffee[is.na(coffee$loc_country), ]
## # A tibble: 0 × 12
## # ℹ 12 variables: name <chr>, roaster <chr>, roast <chr>, loc_country <chr>,
## #   origin_1 <chr>, origin_2 <chr>, 100g_USD <dbl>, rating <dbl>,
## #   review_date <chr>, desc_1 <chr>, desc_2 <chr>, desc_3 <chr>

There are 15 explicit missing rows in ‘roast’ as the value that is returned is ‘NA’ while there are no explicit missing rows in ‘loc_country’. In the ‘coffee_clean’ dataframe, I filtered out the ‘NA’ values in the ‘roast’ column.

2. Implicit Missing Rows

There are no implicit missing rows possible for ‘loc_country’ as there are no explicit missing rows to begin with. Additionally, in ‘roast’ there are no implicit missing rows despite 15 explicit missing rows, as the rest of the values appear such as ‘name’ and ‘roaster’ despite ‘NA’ values appearing in the ‘roast’ column. Despite the explicit missing rows, it is good news that there is data is not missing from the entire row. ## 3. Empty Groups

coffee %>%
  count(roast, drop = FALSE)   # drop = FALSE keeps empty factor levels
## # A tibble: 6 × 3
##   roast        drop      n
##   <chr>        <lgl> <int>
## 1 Dark         FALSE     5
## 2 Light        FALSE   287
## 3 Medium       FALSE   259
## 4 Medium-Dark  FALSE    39
## 5 Medium-Light FALSE  1490
## 6 <NA>         FALSE    15
coffee %>%
  count(loc_country, drop = FALSE)   # drop = FALSE keeps empty factor levels
## # A tibble: 19 × 3
##    loc_country             drop      n
##    <chr>                   <lgl> <int>
##  1 Australia               FALSE     2
##  2 Belgium                 FALSE     1
##  3 Canada                  FALSE    31
##  4 China                   FALSE     4
##  5 England                 FALSE     7
##  6 Guatemala               FALSE    29
##  7 Hawai'i                 FALSE    87
##  8 Honduras                FALSE     1
##  9 Hong Kong               FALSE    20
## 10 Japan                   FALSE    12
## 11 Kenya                   FALSE     1
## 12 Malaysia                FALSE     3
## 13 Mexico                  FALSE     2
## 14 New Taiwan              FALSE     1
## 15 Peru                    FALSE     2
## 16 Taiwan                  FALSE   554
## 17 Uganda                  FALSE     1
## 18 United States           FALSE  1336
## 19 United States And Floyd FALSE     1

There are no empty groups evident for either of these categorical variables. Despite a very low amount of reviews for dark coffee roasts (n = 5) and many countries that only appear once in the data set, to qualify as an empty group, n = 0 would neet to be met, which it is not for these two categorical columns, as all of these rows have at least one observation where n = 1.

5. Continuous Column Outlier

Find outlier(s) for ‘100g_USD’

# Computing outlier thresholds by finding upper and lower bounds
price_stats <- coffee_clean %>%
  summarise(
    Q1 = quantile(`100g_USD`, 0.25, na.rm = TRUE),
    Q3 = quantile(`100g_USD`, 0.75, na.rm = TRUE)
  ) %>%
  mutate(
    IQR = Q3 - Q1,
    lower_bound = Q1 - 1.5 * IQR,
    upper_bound = Q3 + 1.5 * IQR
  )

price_stats
## # A tibble: 1 × 5
##      Q1    Q3   IQR lower_bound upper_bound
##   <dbl> <dbl> <dbl>       <dbl>       <dbl>
## 1  4.93  8.77  3.84       -0.83        14.5
# Identify outliers in dataset
coffee_clean <- coffee_clean %>%
  mutate(
    is_outlier = `100g_USD` < price_stats$lower_bound |
                 `100g_USD` > price_stats$upper_bound
  )
library(ggplot2)

ggplot(coffee_clean, aes(x = "", y = `100g_USD`, color = is_outlier)) +
  geom_boxplot(outlier.shape = NA) +   # hide default outliers
  geom_jitter(width = 0.1, alpha = 0.6, size = 2) +
  scale_color_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick")) +
  labs(
    title = "Boxplot of 100g_USD with Outliers Highlighted",
    x = "",
    y = "Price (100g USD)",
    color = "Outlier?"
  ) +
  theme_minimal()

Outlier Identification Method: The continuous column I focused on to identify outliers is ‘100g_USD’. From working with this data set for a few weeks now, I observed a very wide range of coffee prices per 100g in USD, which is why I wanted to investigate outliers. As we discussed in class, defining outliers can be subjective but I chose to define outliers as price observations that occur above the upper bound resulting from inner quartile range calculations. The price range among coffees is truly as outstanding as it is remarkable, but I could define outliers as the coffees priced over 50 USD which could be roughly 15 of the 2000 plus observations, although I chose to stick to the statistical definition of outliers most commonly used, which results in many more outliers, although the upper bound falls at 14.53 USD, so technically, any price above the upper bound is considered an outlier, as I defined here.