coffee

Author

Angel Alexandria Porter

#Data are available for both Arabica and Robusta coffee beans from multiple countries, with professional ratings assigned on a 0 to 100 scale. The dataset includes scores for attributes such as acidity, sweetness, fragrance, and balance. #This dataset focuses on characteristics of coffee samples and how these features relate to overall quality. The variables include measures such as aroma, flavor, aftertaste, acidity, body, and balance, along with an overall score, which reflects the perceived quality of each coffee. Additional variables, such as origin and processing method, describe where the coffee is produced and how it is prepared.

#In this analysis, I will focus specifically on the relationship between flavor and aroma, using flavor as the x-axis and aroma as the y-axis. I plan to explore whether higher flavor scores are associated with higher aroma scores. This will help determine if there is a positive relationship between these two important sensory attributes and whether improvements in flavor tend to coincide with improvements in aroma. #I created a simple linear plot with flavor as the independent variable (x-axis) and aroma as the dependent variable (y-axis). #Source: https://corgis-edu.github.io/corgis/csv/coffee/ library(ggplot2) library(tidyverse)

Load the data

coffee <- read.csv(“coffee.csv”) coffee <- read.csv(file.choose()) #Create a tab that displays the selected data set. View(coffee) #Check out the first few lines head(coffee) #Count missing values #Handle missing values sum(is.na(coffee)) #How many NA(s) colSums(is.na(coffee)) # NAs per column #Check the structure str(coffee)

#creates an empty graph that is primed to display the coffee data, but since we haven’t told it how to visualize it yet, for now it’s empty. ggplot(data = coffee) # we need to tell ggplot() how variables in your dataset are mapped to visual properties (aesthetics) of your plot. For now, we will only map Data Scores Flavor to the x aesthetic and Data Scores Aroma to the y aesthetic. ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma)) #Represent the observations from our data frame on our plot.The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma) ) + geom_point() #ask if there may be other variable(s) that explain or change the nature of this apparent relationship. Does the relationship between Data Scores Flavor and Data Scores Aroma differ by Locations (Countries)? Graph a scatterplot. ggplot( data = coffee, mapping = aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma, color = Location.Country) ) + geom_point()

##Extract top four countries (with the most observations/highest count) and plot a bargraph. top_counts <- coffee %>% count(Location.Country, sort = TRUE) %>% slice(1:4) ggplot(top_counts, aes(x = reorder(Location.Country, n), y = n, fill = Location.Country)) + # Create bars representing the number of samples per country geom_col() +

# Manually assign colors to each country for consistency and clarity scale_fill_manual(values = c( “#D55E00”, “#0072B2”, “#009E73”, “#CC79A7” )) + # Apply a clean minimal theme for better readability theme_minimal(base_size = 13) + # Add labels and titles to the plot labs( title = “Top 4 Coffee Producing Countries in Dataset”, # Main title x = “Country”, #Give x-axis a name y = “Number of Coffee Samples”, #Give y-axis a name fill = “Country”, #Legend title caption = “Data source: Coffee dataset”, #Caption subtitle = “Coffee from top producing countries”, # Subtitle )

Graph a scatterplot based on flavor score and aroma in the top four countries (with the most observations/highest count).

top_countries <- coffee %>% count(Location.Country, sort = TRUE) %>% # Count observations per country and sort descending slice(1:4) %>% pull(Location.Country)

coffee_filtered <- coffee %>% filter(Location.Country %in% top_countries)

ggplot(coffee_filtered, aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma)) + geom_point( aes(color = Location.Country), position = position_jitter(width = 0.1, height = 0.1), size = 2, alpha = 0.4 ) + geom_smooth(method = “lm”, se = FALSE, color = “black”, formula = y ~ x) +

scale_color_manual(values = c( “#D55E00”, “#0072B2”, “#009E73”, “red” )) +

Ensure x-axis starts at 0

expand_limits(x = 0) +

scale_x_continuous( breaks = seq(0, 10, by = 0.5) ) + scale_y_continuous( breaks = seq(0, 10, by = 0.5) ) +

theme_light(base_size = 13) + theme( plot.title = element_text(face = “bold”, hjust = 0.5) ) +

labs( title = “Relationship Between Flavor and Aroma Scores”, subtitle = “Coffee samples from top producing countries”, x = “Flavor Score”, y = “Aroma Score”, color = “Country”, caption = “Data source: Coffee dataset” )

Fit a Simple Linear regression Model to predict Aroma based on Flavor (no missing values, so all observations are used).

simple_model <- lm(Data.Scores.Aroma ~ Data.Scores.Flavor, data = coffee_arabia)

#View model summary summary(simple_model)

Diagnostic plots #Core diagnostics (covers: linearity, homoscedasticity, normality, influence)**

par(mfrow = c(2, 2)) plot(simple_model)

##Faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable (the four countries with the highest number of coffee samples in the dataset). # Identify the top 4 countries with the most coffee samples top_countries <- coffee %>% count(Location.Country, sort = TRUE) %>% slice(1:4) %>% pull(Location.Country)

Filter the dataset to keep only those top 4 countries

coffee_fine <- coffee %>% filter(Location.Country %in% top_countries)

Create the base plot and save it as p2

p2 <- ggplot(coffee_fine, aes(x = Data.Scores.Flavor, y = Data.Scores.Aroma)) + geom_point( aes(color = Location.Country), position = position_jitter(width = 0.05, height = 0.05), size = 2, alpha = 0.4 ) + geom_smooth(method = “lm”, se = FALSE, color = “black”, formula = y ~ x) + scale_color_manual(values = c( “#D55E00”, “#0072B2”, “#009E73”, “#CC79A7” )) + scale_x_continuous(breaks = seq(0, 10, by = 1)) + scale_y_continuous(breaks = seq(0, 10, by = 1)) + facet_wrap(~ Location.Country) + theme_bw(base_size = 13) + labs( title = “Relationship Between Flavor and Aroma Scores by Country”, subtitle = “Each panel represents a different country”, x = “Flavor Score”, y = “Aroma Score”, color = “Country”, caption = “Data source: Coffee dataset” )

Create p3 by modifying p2

p3 <- p2 + xlim(0, 10) + ylim(0, 10)# Adds another layer of points

Display the final plot

#The essay should describe: a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate). #The null hypothesis: There is no relationship between flavor and aroma. #The alternative hypothesis: There is a relationship between flavor and aroma. #From your coefficients:

#Intercept = 3.10013 #Slope = 0.59380 #Aroma= 3.100 + 0.594×Flavor #A positive association exists between flavor and aroma scores. On average, a one-unit increase in flavor score corresponds to an increase of approximately 0.594 units in aroma score. The p-value for the flavor variable is less than 2 × 10⁻¹⁶, which is significantly below the 0.05 threshold, indicating that flavor is a statistically significant predictor of aroma. In other words, sample data provides enough evidence to support the alternative hypothesis. The adjusted R² value of 0.4564 demonstrates that approximately 45.6% of the variation in aroma scores is explained by flavor scores. This finding indicates a moderate relationship; while flavor accounts for a substantial portion of the variability in aroma, additional factors also contribute. The median residual is approximately zero (−0.00615), indicating that the residuals are centered around zero and the model does not exhibit systematic bias.

#b. What the visualization represents, any interesting patterns or surprises that arise within the visualization. #The scatterplot shows the relationship between flavor scores (x-axis) and aroma scores (y-axis) for coffee samples from the top producing countries. Each point represents a coffee sample, and colors distinguish the country of origin. A linear regression line is included to show the overall trend between the two variables.

#c. Anything that you might have shown that you could not get to work or that you wished you could have included # Yes, I wish I would have shown multiple regression including aftertase, aroma score and Flavor score and could further improve the regression line model. Based on the four countries with the highest number of coffee samples, I determined how to compute the p-values, R² values, and diagnostic results for each country individually.