Final Project (DATA-110)

Introduction

How is the location of the country producing coffee related to the total score of the coffee? The data set I selected to work on contains data about coffee from all over the world throughout the years 2010 to 2018. This dataset has 989 observations and 23 variables, making it perfect for this project. The topic stated in the first bolded sentence is what I am mainly going to discover throughout this project with various coding techniques. I will utilize two variables primarily in this data set, including location_country and data_scores_total, along with other variables to help assist my visualizations and model being year, data_production_number_of_bags, and data_color. I discovered the dataset on the CORGIS Dataset Project website, which was linked to the datasets section on Blackboard. I chose this topic as I was interested in the specific differences of each country’s coffee, as I personally love drinking coffee every morning, as it helps me get a quick boost in the morning. It will be interesting to observe the differences and see where coffee thrives the most. CORGIS Dataset Project Link: https://corgis-edu.github.io/corgis/csv/coffee/.

Background Research

Coffee is one of the world’s most valuable goods, and it supports the lifestyle of more than 125 million people around the world. Research on the coffee business has shown that factors such as altitude, bean variety, growing conditions, and sustainability practices can affect coffee quality and market pricing (Giovannucci & KoeKoek). Studies also suggest that specialty coffee production has grown at a quick rate due to increasing consumer demand for higher quality coffee (Giovannucci & KoeKoek). Additionally, environmental researchers have noticed that coffee production can have large environmental impacts, including water use and carbon emissions, making sustainable material gathering important (Usva et al.). The CORGIS Coffee Dataset includes professional coffee ratings and various characteristics shown in the variables, allowing statistical analysis of how environmental and production factors may influence coffee quality (Donald S.).

References

Donald, S. (2022). Coffee CSV File. CORGIS Dataset Project. https://corgis-edu.github.io/corgis/csv/coffee/

Giovannucci, D., & Koekoek, F. (2003). The state of sustainable coffee: A study of twelve major markets. International Institute for Sustainable Development. https://www.researchgate.net/publication/239616226_The_State_of_Sustainable_Coffee_A_Study_of_Twelve_Major_Markets

Usva, K., Sinkko, T., Silvenius, F., Riipi, I., & Heusala, H. (2020). Carbon and water footprint of coffee consumed in Finland—Life cycle assessment. The International Journal of Life Cycle Assessment, 25, 1976–1990. https://link.springer.com/article/10.1007/s11367-020-01799-5

Loading Libraries, Dataset, & Setting Working Directory

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

setwd("~/Documents/EC/Spring 2026/DATA 110/Project FInal") # Setting Working Directory

coffee <- read_csv("coffee.csv") # Loading Dataset

## Rows: 989 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
## dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(coffee)

## spc_tbl_ [989 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Location.Country              : chr [1:989] "United States" "Brazil" "Brazil" "Ethiopia" ...
##  $ Location.Region               : chr [1:989] "kona" "sul de minas - carmo de minas" "sul de minas - carmo de minas" "sidamo" ...
##  $ Location.Altitude.Min         : num [1:989] 0 12 12 0 0 0 1300 0 0 640 ...
##  $ Location.Altitude.Max         : num [1:989] 0 12 12 0 0 0 1400 0 0 1400 ...
##  $ Location.Altitude.Average     : num [1:989] 0 12 12 0 0 0 1350 0 0 1020 ...
##  $ Year                          : num [1:989] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Data.Owner                    : chr [1:989] "kona pacific farmers cooperative" "jacques pereira carneiro" "jacques pereira carneiro" "ethiopia commodity exchange" ...
##  $ Data.Type.Species             : chr [1:989] "Arabica" "Arabica" "Arabica" "Arabica" ...
##  $ Data.Type.Variety             : chr [1:989] "nan" "Yellow Bourbon" "Yellow Bourbon" "nan" ...
##  $ Data.Type.Processing method   : chr [1:989] "nan" "nan" "nan" "nan" ...
##  $ Data.Production.Number of bags: num [1:989] 25 300 300 360 300 12 10 360 300 85 ...
##  $ Data.Production.Bag weight    : num [1:989] 45.4 60 60 6 6 ...
##  $ Data.Scores.Aroma             : num [1:989] 8.25 8.17 8.42 7.67 7.58 7.5 7.67 7.25 7.42 6.92 ...
##  $ Data.Scores.Flavor            : num [1:989] 8.42 7.92 7.92 8 7.83 7.92 7.58 7.25 7.42 6.75 ...
##  $ Data.Scores.Aftertaste        : num [1:989] 8.08 7.92 8 7.83 7.58 7.42 7.5 7.25 7.5 7.08 ...
##  $ Data.Scores.Acidity           : num [1:989] 7.75 7.75 7.75 8 8 7.67 7.58 7.33 7.92 7.17 ...
##  $ Data.Scores.Body              : num [1:989] 7.67 8.33 7.92 7.92 7.83 7.83 7.67 7.5 7.75 7.33 ...
##  $ Data.Scores.Balance           : num [1:989] 7.83 8 8 7.83 7.5 7.58 7.58 8 7.58 6.67 ...
##  $ Data.Scores.Uniformity        : num [1:989] 10 10 10 10 10 10 10 9.33 8.67 10 ...
##  $ Data.Scores.Sweetness         : num [1:989] 10 10 10 10 10 10 10 10 8.67 8.67 ...
##  $ Data.Scores.Moisture          : num [1:989] 0 0.08 0.01 0 0.1 0.01 0 0.1 0.05 0.08 ...
##  $ Data.Scores.Total             : num [1:989] 86.2 86.2 86.2 85.1 83.8 ...
##  $ Data.Color                    : chr [1:989] "Unknown" "Unknown" "Unknown" "Unknown" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Location.Country = col_character(),
##   ..   Location.Region = col_character(),
##   ..   Location.Altitude.Min = col_double(),
##   ..   Location.Altitude.Max = col_double(),
##   ..   Location.Altitude.Average = col_double(),
##   ..   Year = col_double(),
##   ..   Data.Owner = col_character(),
##   ..   Data.Type.Species = col_character(),
##   ..   Data.Type.Variety = col_character(),
##   ..   `Data.Type.Processing method` = col_character(),
##   ..   `Data.Production.Number of bags` = col_double(),
##   ..   `Data.Production.Bag weight` = col_double(),
##   ..   Data.Scores.Aroma = col_double(),
##   ..   Data.Scores.Flavor = col_double(),
##   ..   Data.Scores.Aftertaste = col_double(),
##   ..   Data.Scores.Acidity = col_double(),
##   ..   Data.Scores.Body = col_double(),
##   ..   Data.Scores.Balance = col_double(),
##   ..   Data.Scores.Uniformity = col_double(),
##   ..   Data.Scores.Sweetness = col_double(),
##   ..   Data.Scores.Moisture = col_double(),
##   ..   Data.Scores.Total = col_double(),
##   ..   Data.Color = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(coffee)

## # A tibble: 6 × 23
##   Location.Country Location.Region   Location.Altitude.Min Location.Altitude.Max
##   <chr>            <chr>                             <dbl>                 <dbl>
## 1 United States    kona                                  0                     0
## 2 Brazil           sul de minas - c…                    12                    12
## 3 Brazil           sul de minas - c…                    12                    12
## 4 Ethiopia         sidamo                                0                     0
## 5 Ethiopia         sidamo                                0                     0
## 6 United States    kona                                  0                     0
## # ℹ 19 more variables: Location.Altitude.Average <dbl>, Year <dbl>,
## #   Data.Owner <chr>, Data.Type.Species <chr>, Data.Type.Variety <chr>,
## #   `Data.Type.Processing method` <chr>,
## #   `Data.Production.Number of bags` <dbl>, `Data.Production.Bag weight` <dbl>,
## #   Data.Scores.Aroma <dbl>, Data.Scores.Flavor <dbl>,
## #   Data.Scores.Aftertaste <dbl>, Data.Scores.Acidity <dbl>,
## #   Data.Scores.Body <dbl>, Data.Scores.Balance <dbl>, …

Cleaning

names(coffee) <- gsub("[(). \\-]", "_", names(coffee))
names(coffee) <- gsub("_$", "", names(coffee))
names(coffee) <- tolower(names(coffee))

head(coffee)

## # A tibble: 6 × 23
##   location_country location_region   location_altitude_min location_altitude_max
##   <chr>            <chr>                             <dbl>                 <dbl>
## 1 United States    kona                                  0                     0
## 2 Brazil           sul de minas - c…                    12                    12
## 3 Brazil           sul de minas - c…                    12                    12
## 4 Ethiopia         sidamo                                0                     0
## 5 Ethiopia         sidamo                                0                     0
## 6 United States    kona                                  0                     0
## # ℹ 19 more variables: location_altitude_average <dbl>, year <dbl>,
## #   data_owner <chr>, data_type_species <chr>, data_type_variety <chr>,
## #   data_type_processing_method <chr>, data_production_number_of_bags <dbl>,
## #   data_production_bag_weight <dbl>, data_scores_aroma <dbl>,
## #   data_scores_flavor <dbl>, data_scores_aftertaste <dbl>,
## #   data_scores_acidity <dbl>, data_scores_body <dbl>,
## #   data_scores_balance <dbl>, data_scores_uniformity <dbl>, …

Selecting, Filtering, & Filtering Out NAs

coffee2 <- coffee |>
  select(location_country, year, data_production_number_of_bags, data_production_bag_weight, data_scores_total, data_color) |> # Selecting Variables
  filter(!is.na(data_color)) |> # Filtering out NA Values
  filter(!is.na(location_country)) |>
  filter(!is.na(year)) |>
  filter(!is.na(data_production_number_of_bags)) |>
  filter(!is.na(data_production_bag_weight)) |>
  filter(!is.na(data_scores_total)) |>
  rename(country = location_country) |> # Renaming Variables
  rename(number_of_bags = data_production_number_of_bags) |>
  rename(bag_weight = data_production_bag_weight) |>
  rename(color = data_color)
head(coffee2)

## # A tibble: 6 × 6
##   country        year number_of_bags bag_weight data_scores_total color  
##   <chr>         <dbl>          <dbl>      <dbl>             <dbl> <chr>  
## 1 United States  2010             25       45.4              86.2 Unknown
## 2 Brazil         2010            300       60                86.2 Unknown
## 3 Brazil         2010            300       60                86.2 Unknown
## 4 Ethiopia       2010            360        6                85.1 Unknown
## 5 Ethiopia       2010            300        6                83.8 Unknown
## 6 United States  2010             12       60                83.4 Unknown

Statistical Analysis

Multiple Linear Regression

multiple_model <- lm(data_scores_total ~ country + year + number_of_bags + bag_weight + color, data = coffee2)
summary(multiple_model)

## 
## Call:
## lm(formula = data_scores_total ~ country + year + number_of_bags + 
##     bag_weight + color, data = coffee2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.007  -0.877   0.403   1.485   7.684 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          4.021e+01  1.670e+02   0.241 0.809822    
## countryBurundi                      -9.862e-01  2.648e+00  -0.372 0.709615    
## countryChina                         1.992e-02  1.062e+00   0.019 0.985045    
## countryColombia                      4.089e-01  5.790e-01   0.706 0.480265    
## countryCosta Rica                   -3.244e-01  7.410e-01  -0.438 0.661643    
## countryCote d?Ivoire                -4.330e+00  3.732e+00  -1.160 0.246267    
## countryEcuador                      -3.376e+00  2.205e+00  -1.531 0.126144    
## countryEl Salvador                   8.210e-01  1.587e+00   0.517 0.604935    
## countryEthiopia                      2.608e+00  8.624e-01   3.024 0.002563 ** 
## countryGuatemala                    -8.873e-01  5.641e-01  -1.573 0.116077    
## countryHaiti                        -5.974e+00  1.580e+00  -3.780 0.000167 ***
## countryHonduras                     -3.644e+00  7.371e-01  -4.945 9.02e-07 ***
## countryIndia                        -1.763e+00  1.096e+00  -1.609 0.107908    
## countryIndonesia                    -8.138e-01  9.977e-01  -0.816 0.414881    
## countryKenya                         1.291e+00  9.423e-01   1.370 0.170948    
## countryLaos                         -1.574e+00  3.738e+00  -0.421 0.673693    
## countryMalawi                       -1.452e+00  1.217e+00  -1.193 0.233118    
## countryMexico                       -1.875e+00  5.569e-01  -3.366 0.000793 ***
## countryMyanmar                      -2.221e+00  1.406e+00  -1.580 0.114366    
## countryNicaragua                    -2.931e+00  1.009e+00  -2.904 0.003767 ** 
## countryPanama                        6.689e-01  1.900e+00   0.352 0.724929    
## countryPapua New Guinea              2.182e+00  3.727e+00   0.585 0.558412    
## countryPeru                         -5.060e-01  1.323e+00  -0.383 0.702147    
## countryPhilippines                  -2.125e+00  1.718e+00  -1.237 0.216565    
## countryRwanda                       -6.397e-01  3.723e+00  -0.172 0.863631    
## countryTaiwan                        9.114e-01  3.736e+00   0.244 0.807299    
## countryTanzania, United Republic Of -4.840e-01  7.849e-01  -0.617 0.537667    
## countryThailand                     -6.361e-01  9.681e-01  -0.657 0.511307    
## countryUganda                        6.713e-01  7.789e-01   0.862 0.389004    
## countryUnited States                -1.338e+00  6.820e-01  -1.962 0.050100 .  
## countryVietnam                      -1.549e+00  1.412e+00  -1.097 0.272892    
## countryZambia                       -1.390e+00  3.724e+00  -0.373 0.709040    
## year                                 2.152e-02  8.298e-02   0.259 0.795418    
## number_of_bags                      -1.141e-03  1.182e-03  -0.965 0.334820    
## bag_weight                          -2.294e-06  7.308e-05  -0.031 0.974969    
## colorBluish-Green                    6.544e-02  6.335e-01   0.103 0.917748    
## colorGreen                          -6.531e-01  5.356e-01  -1.219 0.223036    
## colorNone                           -1.419e+00  7.726e-01  -1.836 0.066637 .  
## colorUnknown                        -2.291e-01  6.307e-01  -0.363 0.716544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.673 on 950 degrees of freedom
## Multiple R-squared:  0.1292, Adjusted R-squared:  0.09442 
## F-statistic: 3.711 on 38 and 950 DF,  p-value: 1.287e-12

Interpertation

Y = 40.21 + 0.02152(year) − 0.00114(number_of_bags) − 0.00000229(bag_weight) + country values + color values

I created a multiple linear regression model to predict the total data scores of coffee using the variables county, year, number_of_bags, bag_weight, and color. A lot of factors were not statistically significant, but the key ones worth noting were all in the country variable, including Ethiopia, Haiti, Honduras, Mexico, and Nicaragua. We can tell the significance by observing the stars on the right of the p-value next to these variables. Using the p-values and estimates, we can observe that coffees from Ethiopia scored significantly higher, coffees from Haiti scored significantly lower, Honduras has a strong negative effect on scores, along with Mexico and Nicaragua. The adjusted R-squared value is 0.0944, meaning that 9.44% of this model explains the variation in coffee data scores totals. However, the model is still statistically significant as the p-value of the model is 1.287e-12, showing that the variables provide useful information for predicting total data scores.

Diagnostic Plots

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

## Warning: not plotting observations with leverage one:
##   45, 72, 644, 774, 792, 917

Reflection

Residuals vs Fitted: The plots are slightly distributed around 0 and have some outlines away from the red line.
Q-Q Residuals: The line is more horizontal rather than diagonal
Scale-Location: The plots are not evenly distributed and for the most part the line is horizontal
Residuals vs Leverage: The points do not fall in the cook’s distance line

Description of Tablaeu Visualizations

Link: https://public.tableau.com/views/FinalProjectDashboardDATA-110/Dashboard1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

For my first visualization, I created a map chart between the variables of country and data_scores_total. In this visualization, we can see an interactive image of a map with different shades of blue for the ones that have values and white for the countries not listed in this dataset. When you hover over the country with your cursor, you can see the total number of data scores throughout the years 2010-2018 and the name of the country at first. There is an adjustable year slider to change the time frame, allowing you to show emphasis on a single year or multiple years at once. For the smaller countries, a zoom-in feature is available to get a value from them.

For my next visualization, I created a bar graph between the variables of country, number_of_bags, & bag_weight. This bar graph is organized in descending order from the most amount of bags to the smallest. When you hover over a specific bar column, you can also see the specific weight of the countries category. This graph also changes based on the specific setting the year slider is on. This bar graph shows that bag weight does not always result in a higher number of bags produced.

For my last visualization, I created a bubble chart that shows the various types of coffee bean colors in relation to the total data scores. I wanted to see which color correlated with a higher total data score. With this visualization, we can see three colors in the legend, and the bigger the circle of each color is, the higher its total data score is. This visualization also changes as you configure the year slider in the legend. For this visualization I utilized the variables color and data_scores_total.

Conclusion of Visualizations & Future Directions

I spoke about all these visualizations very briefly, as I created three interactive visualizations through Tableau. What surprised me as I observed these visualizations was that the United States was not as high as I thought it would be. I personally love the coffee we have here, and I drink a cup at least once a day, so this surprised me a lot. It was surprising to see Guatemala at the top of the bar graph, as it produces the most bags out of any country in this dataset. We can also observe that Mexico has the highest total data scores out of any country in this dataset. It was also interesting to see how many countries were either excluded or did not have data on coffee in this dataset. A pattern that occurred almost all of the time was the green color coffee bean having the strongest correlation to a stronger total data score. Using all of analysis’s and visualizations I created, we can conclude that the location & many other factors affect the total scoring of coffee around the world. In the future, this dataset can be improved by gathering more data from a larger variety of countries to have a more complete analysis. Lastly, if we could get more recent data from the 2020s, that would also improve the future analysis’s relevance.