coffee

Author

Angel Alexandria Porter

#Data are available for both Arabica and Robusta coffee beans from multiple countries, with professional ratings assigned on a 0 to 10 scale. The dataset includes scores for attributes such as acidity, sweetness, fragrance, and balance. #This dataset focuses on characteristics of coffee samples and how these features relate to overall quality. The variables include measures such as aroma, flavor, aftertaste, acidity, body, and balance, along with an overall score, which reflects the perceived quality of each coffee. Additional variables, such as origin and processing method, describe where the coffee is produced and how it is prepared.

#In this analysis, I will focus specifically on the relationship between flavor and aroma, using flavor as the x-axis and aroma as the y-axis. I plan to explore whether higher flavor scores are associated with higher aroma scores. This will help determine if there is a positive relationship between these two important sensory attributes and whether improvements in flavor tend to coincide with improvements in aroma. #I created a simple linear plot with flavor as the independent variable (x-axis) and aroma as the dependent variable (y-axis). #Source: https://corgis-edu.github.io/corgis/csv/coffee/

library(ggplot2) 
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#Load the data

coffee <- read.csv("coffee.csv") 
coffee <- read.csv(file.choose())

#Create a tab that displays the selected data set.

View(coffee)

#Check out the first few lines

head(coffee)

  Location.Country               Location.Region Location.Altitude.Min
1    United States                          kona                     0
2           Brazil sul de minas - carmo de minas                    12
3           Brazil sul de minas - carmo de minas                    12
4         Ethiopia                        sidamo                     0
5         Ethiopia                        sidamo                     0
6    United States                          kona                     0
  Location.Altitude.Max Location.Altitude.Average Year
1                     0                         0 2010
2                    12                        12 2010
3                    12                        12 2010
4                     0                         0 2010
5                     0                         0 2010
6                     0                         0 2010
                        Data.Owner Data.Type.Species Data.Type.Variety
1 kona pacific farmers cooperative           Arabica               nan
2         jacques pereira carneiro           Arabica    Yellow Bourbon
3         jacques pereira carneiro           Arabica    Yellow Bourbon
4      ethiopia commodity exchange           Arabica               nan
5      ethiopia commodity exchange           Arabica               nan
6 kona pacific farmers cooperative           Arabica               nan
  Data.Type.Processing.method Data.Production.Number.of.bags
1                         nan                             25
2                         nan                            300
3                         nan                            300
4                         nan                            360
5                         nan                            300
6                         nan                             12
  Data.Production.Bag.weight Data.Scores.Aroma Data.Scores.Flavor
1                    45.3592              8.25               8.42
2                    60.0000              8.17               7.92
3                    60.0000              8.42               7.92
4                     6.0000              7.67               8.00
5                     6.0000              7.58               7.83
6                    60.0000              7.50               7.92
  Data.Scores.Aftertaste Data.Scores.Acidity Data.Scores.Body
1                   8.08                7.75             7.67
2                   7.92                7.75             8.33
3                   8.00                7.75             7.92
4                   7.83                8.00             7.92
5                   7.58                8.00             7.83
6                   7.42                7.67             7.83
  Data.Scores.Balance Data.Scores.Uniformity Data.Scores.Sweetness
1                7.83                     10                    10
2                8.00                     10                    10
3                8.00                     10                    10
4                7.83                     10                    10
5                7.50                     10                    10
6                7.58                     10                    10
  Data.Scores.Moisture Data.Scores.Total Data.Color
1                 0.00             86.25    Unknown
2                 0.08             86.17    Unknown
3                 0.01             86.17    Unknown
4                 0.00             85.08    Unknown
5                 0.10             83.83    Unknown
6                 0.01             83.42    Unknown

#Count missing values #Handle missing values

sum(is.na(coffee))

[1] 0

#How many NA(s) # NAs per column

colSums(is.na(coffee))

              Location.Country                Location.Region 
                             0                              0 
         Location.Altitude.Min          Location.Altitude.Max 
                             0                              0 
     Location.Altitude.Average                           Year 
                             0                              0 
                    Data.Owner              Data.Type.Species 
                             0                              0 
             Data.Type.Variety    Data.Type.Processing.method 
                             0                              0 
Data.Production.Number.of.bags     Data.Production.Bag.weight 
                             0                              0 
             Data.Scores.Aroma             Data.Scores.Flavor 
                             0                              0 
        Data.Scores.Aftertaste            Data.Scores.Acidity 
                             0                              0 
              Data.Scores.Body            Data.Scores.Balance 
                             0                              0 
        Data.Scores.Uniformity          Data.Scores.Sweetness 
                             0                              0 
          Data.Scores.Moisture              Data.Scores.Total 
                             0                              0 
                    Data.Color 
                             0

#Check the structure

str(coffee)

'data.frame':   989 obs. of  23 variables:
 $ Location.Country              : chr  "United States" "Brazil" "Brazil" "Ethiopia" ...
 $ Location.Region               : chr  "kona" "sul de minas - carmo de minas" "sul de minas - carmo de minas" "sidamo" ...
 $ Location.Altitude.Min         : int  0 12 12 0 0 0 1300 0 0 640 ...
 $ Location.Altitude.Max         : int  0 12 12 0 0 0 1400 0 0 1400 ...
 $ Location.Altitude.Average     : int  0 12 12 0 0 0 1350 0 0 1020 ...
 $ Year                          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ Data.Owner                    : chr  "kona pacific farmers cooperative" "jacques pereira carneiro" "jacques pereira carneiro" "ethiopia commodity exchange" ...
 $ Data.Type.Species             : chr  "Arabica" "Arabica" "Arabica" "Arabica" ...
 $ Data.Type.Variety             : chr  "nan" "Yellow Bourbon" "Yellow Bourbon" "nan" ...
 $ Data.Type.Processing.method   : chr  "nan" "nan" "nan" "nan" ...
 $ Data.Production.Number.of.bags: int  25 300 300 360 300 12 10 360 300 85 ...
 $ Data.Production.Bag.weight    : num  45.4 60 60 6 6 ...
 $ Data.Scores.Aroma             : num  8.25 8.17 8.42 7.67 7.58 7.5 7.67 7.25 7.42 6.92 ...
 $ Data.Scores.Flavor            : num  8.42 7.92 7.92 8 7.83 7.92 7.58 7.25 7.42 6.75 ...
 $ Data.Scores.Aftertaste        : num  8.08 7.92 8 7.83 7.58 7.42 7.5 7.25 7.5 7.08 ...
 $ Data.Scores.Acidity           : num  7.75 7.75 7.75 8 8 7.67 7.58 7.33 7.92 7.17 ...
 $ Data.Scores.Body              : num  7.67 8.33 7.92 7.92 7.83 7.83 7.67 7.5 7.75 7.33 ...
 $ Data.Scores.Balance           : num  7.83 8 8 7.83 7.5 7.58 7.58 8 7.58 6.67 ...
 $ Data.Scores.Uniformity        : num  10 10 10 10 10 10 10 9.33 8.67 10 ...
 $ Data.Scores.Sweetness         : num  10 10 10 10 10 10 10 10 8.67 8.67 ...
 $ Data.Scores.Moisture          : num  0 0.08 0.01 0 0.1 0.01 0 0.1 0.05 0.08 ...
 $ Data.Scores.Total             : num  86.2 86.2 86.2 85.1 83.8 ...
 $ Data.Color                    : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...

#creates an empty graph that is primed to display the coffee data, but since we haven’t told it how to visualize it yet, for now it’s empty.

ggplot(data = coffee)

#we need to tell ggplot() how variables in your dataset are mapped to visual properties (aesthetics) of your plot. #For now, we will only map Data Scores Flavor to the x aesthetic and Data Scores Aroma to the y aesthetic.

ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma))

#Represent the observations from our data frame on our plot.The function geom_point() adds a layer of points to your plot, which creates a scatterplot.

ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma) ) + geom_point()

#ask if there may be other variable(s) that explain or change the nature of this apparent relationship. Does the relationship between Data Scores Flavor and Data Scores Aroma differ by Locations (Countries)? #Graph a scatterplot.

ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma, color = Location.Country) ) + geom_point()

#Extract top four countries (with the most observations/highest count) and plot a bargraph.

top_counts <- coffee %>% count(Location.Country, sort = TRUE) %>% slice(1:4) 
ggplot(top_counts, aes(x = reorder(Location.Country, n), y = n, fill = Location.Country)) + 
# Create bars representing the number of samples per country
  geom_col() +
#Manually assign colors to each country for consistency and clarity

scale_fill_manual(values = c( "#D55E00", "#0072B2", "#009E73", "#CC79A7" )) + #Apply a clean minimal theme for better readability 
  theme_minimal(base_size = 13) + 
  #Add labels and titles to the plot 
  labs (title = "Top 4 Coffee Producing Countries in Dataset",
    # Main title
        x = "Country", #Give x-axis a name 
        y = "Number of Coffee Samples", #Give y-axis a name 
        fill = "Country", #Legend title 
        caption = "Data source: Coffee dataset", #Caption 
        subtitle = "Coffee from top producing countries", 
    # Subtitle
    )

Graph a scatterplot based on flavor score and aroma in the top four countries (with the most observations/highest count).

top_countries <- coffee %>%
  count(Location.Country, sort = TRUE) %>%
  slice(1:4) %>%   # Count observations per country and keep top 4
  pull(Location.Country)

coffee_filtered <- coffee %>%
  filter(Location.Country %in% top_countries)

ggplot(coffee_filtered, aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma)) +
  geom_point(
    aes(color = Location.Country),
    position = position_jitter(width = 0.1, height = 0.1),
    size = 2,
    alpha = 0.4
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) +
  scale_color_manual(values = c("#D55E00", "#0072B2", "#009E73", "red")) +
  expand_limits(x = 0) +
  scale_x_continuous(breaks = seq(0, 10, by = 0.5)) +
  scale_y_continuous(breaks = seq(0, 10, by = 0.5)) +
  theme_light(base_size = 13) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
  labs(
    title = "Relationship Between Flavor and Aroma Scores",
    subtitle = "Coffee samples from top producing countries",
    x = "Flavor Score",
    y = "Aroma Score",
    color = "Country",
    caption = "Data source: Coffee dataset"
  )

# Fit a Simple Linear regression Model to predict Aroma based on Flavor (no missing values, so all observations are used).
simple_model <- lm(Data.Scores.Aroma ~ Data.Scores.Flavor, data = coffee_filtered)
#View model summary 
summary(simple_model)


Call:
lm(formula = Data.Scores.Aroma ~ Data.Scores.Flavor, data = coffee_filtered)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62211 -0.10116 -0.00615  0.12729  0.54729 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         3.10013    0.20492   15.13   <2e-16 ***
Data.Scores.Flavor  0.59380    0.02733   21.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2105 on 560 degrees of freedom
Multiple R-squared:  0.4574,    Adjusted R-squared:  0.4564 
F-statistic: 472.1 on 1 and 560 DF,  p-value: < 2.2e-16

# Diagnostic plots #Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow = c(2, 2)) 
plot(simple_model)

#Faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable (the four countries with the highest number of coffee samples in the dataset). #Identify the top 4 countries with the most coffee samples

top_countries <- coffee %>% count(Location.Country, sort = TRUE) %>% slice(1:4) %>% pull(Location.Country)

Filter the dataset to keep only those top 4 countries

coffee_fine <- coffee %>% filter(Location.Country %in% top_countries)

Create the plot and save it as p2

p2 <- ggplot(coffee_fine, aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma)) + geom_point( aes(color = Location.Country), position = position_jitter(width = 0.05, height = 0.05), size = 2, alpha = 0.4 ) + geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) + scale_color_manual(values = c( "#D55E00", "#0072B2", "#009E73", "#CC79A7" )) + scale_x_continuous(breaks = seq(0, 10, by = 1)) + scale_y_continuous(breaks = seq(0, 10, by = 1)) + facet_wrap(~ Location.Country) + theme_bw(base_size = 13) + labs( title = "Relationship Between Flavor and Aroma Scores by Country", subtitle = "Each panel represents a different country", x = "Flavor Score", y = "Aroma Score", color = "Country", caption = "Data source: Coffee dataset" )

Create p3 by modifying p2

p3 <- p2 + xlim(0, 10) + ylim(0, 10)# Adds another layer of points

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Display the final plot

p3

#The essay should describe: a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate). #The null hypothesis: There is no relationship between flavor and aroma. #The alternative hypothesis: There is a relationship between flavor and aroma. #From your coefficients:

#Intercept = 3.10013 #Slope = 0.59380 #Aroma= 3.100 + 0.594×Flavor #A positive association exists between flavor and aroma scores. On average, a one-unit increase in flavor score corresponds to an increase of approximately 0.594 units in aroma score. The p-value for the flavor variable is less than 2 × 10⁻¹⁶, which is significantly below the 0.05 threshold, indicating that flavor is a statistically significant predictor of aroma. In other words, sample data provides enough evidence to support the alternative hypothesis. The adjusted R² value of 0.4564 demonstrates that approximately 45.6% of the variation in aroma scores is explained by flavor scores. This finding indicates a moderate relationship; while flavor accounts for a substantial portion of the variability in aroma, additional factors also contribute. The median residual is approximately zero (−0.00615), indicating that the residuals are centered around zero and the model does not exhibit systematic bias.

#b. What the visualization represents, any interesting patterns or surprises that arise within the visualization. #The scatterplot shows the relationship between flavor scores (x-axis) and aroma scores (y-axis) for coffee samples from the top producing countries. Each point represents a coffee sample, and colors distinguish the country of origin. A linear regression line is included to show the overall trend between the two variables.

#c. Anything that you might have shown that you could not get to work or that you wished you could have included # Yes, I wish I would have shown multiple regression including aftertase, aroma score and Flavor score and could further improve the regression line model. Based on the four countries with the highest number of coffee samples, I determined how to compute the p-values, R² values, and diagnostic results for each country individually.