The data was downloaden at the following URL: https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi
Coffee_Data <- read.csv2("Coffee_dataset.csv", header = TRUE, sep = ",", dec = ".")
First fix the weight of the bags in kg
Coffee_Data$Bag.Weight <- gsub("kg", "", Coffee_Data$Bag.Weight)
Coffee_Data$Bag.Weight <- as.integer(Coffee_Data$Bag.Weight)
Now we fix the altitude using lapply
Altitude <- str_split(Coffee_Data$Altitude, "-")
Altitude <- lapply(lapply(Altitude, as.integer), mean)
## Warning in lapply(Altitude, as.integer): NAs introduced by coercion
## Warning in lapply(Altitude, as.integer): NAs introduced by coercion
Coffee_Data$Altitude <- Altitude
Two datapoint that used other seperators for the range were automaticaly turned into NA.
Now that all the data has been converted, we can start doing some analysis. What region in Brazil has the highest average overall rating?
brazilian_coffee <- filter(Coffee_Data, Country.of.Origin == "Brazil")
aggregate(brazilian_coffee$Overall, by = list(brazilian_coffee$Region), FUN = mean)
## Group.1 x
## 1 Alta Mogiana-Ibiraci 7.21
## 2 Campo das Vertentes 7.83
## 3 MANTIQUEIRA / SUL DE MINAS 7.33
## 4 Mantiquira de minas 7.42
## 5 Minas Gerais 6.67
## 6 Região Vulcânica 8.00
## 7 Sul de Minas 7.36
brazilian_coffee <- brazilian_coffee %>%
group_by(Region) %>%
summarise(mean = mean(Overall), Highest = max(Overall), n = n())
show(brazilian_coffee)
## # A tibble: 7 × 4
## Region mean Highest n
## <chr> <dbl> <dbl> <int>
## 1 Alta Mogiana-Ibiraci 7.21 7.25 2
## 2 Campo das Vertentes 7.83 7.83 1
## 3 MANTIQUEIRA / SUL DE MINAS 7.33 7.33 1
## 4 Mantiquira de minas 7.42 7.42 1
## 5 Minas Gerais 6.67 6.67 1
## 6 Região Vulcânica 8 8 1
## 7 Sul de Minas 7.36 7.67 3
What country produces the most amount of coffee?
Coffee_Data$Total <- Coffee_Data$Number.of.Bags * Coffee_Data$Bag.Weight
country_production <- Coffee_Data %>%
group_by(Country.of.Origin) %>%
filter(Total < 1000000) %>% #filter because of the unlikely amount of coffe from ethiopia.
summarise(sum = sum(Total))
par(mar=c(12, 4, 4, 4))
barplot(country_production$sum / 1000 ~ country_production$Country.of.Origin,
las = 2, cex.names = 1,
xlab = "", ylab = "Coffee produced in tons",
main = "Coffee produced per country")
For big sellers a well balanced coffee is important. How do the different atributes compare in the top 10 most coffee producing countries?
country_atributes <- Coffee_Data %>%
group_by(Country.of.Origin) %>%
filter(Total < 1000000) %>%
summarise(sum = sum(Total), aroma = mean(Aroma), body = mean(Body), acidity = mean(Acidity))
top10 <- country_atributes[order(country_atributes$sum, decreasing = TRUE),][1:10,]
#conversion for ggplot
top10 <- data.frame(country=rep(c(top10$Country.of.Origin), times=3),
atribute=rep(c("aroma", "body", "acidity"), each=10),
score=c(top10$aroma, top10$body, top10$acidity))
ggplot(top10, aes(fill=atribute, y=score, x=country)) +
geom_bar(position='dodge', stat='identity') +
theme(axis.text.x = element_text( angle = 45, hjust = 1)) +
labs(title = "Coffee atribute scores in top 10 most producefull countries",
x = "",
y = "Score on scale 1-10") +
coord_cartesian(ylim = c(6,9)) +
scale_y_continuous(breaks = seq(6, 10, by=0.5))
If you want a balanced cup of coffee you can go with brazil or Tanzania. In brazil the overall rating is lower, but more coffee is produced. And how much do these qualities vary worldwide? I used a violin plot instead of a boxplot because i thought it looked cool and wanted to try it but is almost the same as making a boxplot (the code at least).
data <- data.frame(
Atribute = rep(c("Aroma", "Flavor", "Aftertaste", "Acidity", "Body", "Balance", "Overall"), each = max(Coffee_Data$X)+1),
score = c(Coffee_Data$Aroma,
Coffee_Data$Flavor,
Coffee_Data$Aftertaste,
Coffee_Data$Acidity,
Coffee_Data$Body,
Coffee_Data$Balance,
Coffee_Data$Overall)
)
ggplot(data, aes(x=Atribute, y=score)) +
geom_violin()
All flavour groups seem to be equily divided, Body is less spread out and Aroma the most. Now a processing method is needed. What method is most used?
data <- Coffee_Data %>%
count(Processing.Method) %>%
mutate(group = ifelse(10 < n, Processing.Method, 'Other')) %>%
group_by(group) %>%
summarise(total = sum(n))
pie(data$total, labels = data$group) +
title("Most used coffee processing methods")
## integer(0)
Washed / Wet is the most used processing method.
And then storing the beans. How does moisture levels effect the coffee rating?
p <- plot(Coffee_Data$Moisture.Percentage ~ Coffee_Data$Total.Cup.Points,
ylab = "Bean Moistrue %",
xlab = "Total points scored",
main = "Effect of moisture on cup rating")
model <- lm(Coffee_Data$Moisture.Percentage ~ Coffee_Data$Total.Cup.Points)
abline(model, col = "red")
If the moisture level is kept between 8-14%, this does not seem te effect the bean quality