Coffee Ratings: Univariate & Bivariate Analysis

In this project, I performed univariate and bivariate analysis on different variables from coffee ratings dataset which I imported from tidy tuesday.

Loading packages

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(skimr)
library(RColorBrewer)

Loading Data

The data set contains different variables, all related to different characteristics of coffee.

coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   total_cup_points = col_double(),
##   number_of_bags = col_double(),
##   aroma = col_double(),
##   flavor = col_double(),
##   aftertaste = col_double(),
##   acidity = col_double(),
##   body = col_double(),
##   balance = col_double(),
##   uniformity = col_double(),
##   clean_cup = col_double(),
##   sweetness = col_double(),
##   cupper_points = col_double(),
##   moisture = col_double(),
##   category_one_defects = col_double(),
##   quakers = col_double(),
##   category_two_defects = col_double(),
##   altitude_low_meters = col_double(),
##   altitude_high_meters = col_double(),
##   altitude_mean_meters = col_double()
## )
## i Use `spec()` for the full column specifications.

Selecting Variables

coffee<- coffee_ratings %>% select(total_cup_points,
species,
country_of_origin,
number_of_bags,
variety,
processing_method,
flavor,
uniformity,
category_one_defects
) %>%view()

Univariate Analysis

First variable I am using for univariate analysis is country_of_origin which is the country where the coffee is from.

Categorical variable: country_of_origin

Using the glimpse function

glimpse(coffee$country_of_origin)

##  chr [1:1339] "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" "Ethiopia" ...

We can see that country is a categorical variable with 1339 values

Bar chart

Mexico is the country where most of the coffee comes from.

ggplot(coffee)+
  geom_bar(aes(country_of_origin))+
  labs(title="Count of Countries",x="Count",y="Country")+
  coord_flip()+
  theme_minimal()

Proportion table

I have shown the distribution of countries in this variable by proportion which also confirms that Mexico indeed has the biggest proportion out of all countries, followed by Colombia and Guatemala.

sort(round(proportions(table(coffee$country_of_origin)),2),decreasing=TRUE)

## 
##                       Mexico                     Colombia 
##                         0.18                         0.14 
##                    Guatemala                       Brazil 
##                         0.14                         0.10 
##                       Taiwan       United States (Hawaii) 
##                         0.06                         0.05 
##                   Costa Rica                     Honduras 
##                         0.04                         0.04 
##                     Ethiopia Tanzania, United Republic Of 
##                         0.03                         0.03 
##                       Uganda                  El Salvador 
##                         0.03                         0.02 
##                        Kenya                    Nicaragua 
##                         0.02                         0.02 
##                     Thailand                        China 
##                         0.02                         0.01 
##                        India                    Indonesia 
##                         0.01                         0.01 
##                       Malawi                      Myanmar 
##                         0.01                         0.01 
##                         Peru                United States 
##                         0.01                         0.01 
##                      Vietnam                      Burundi 
##                         0.01                         0.00 
##                Cote d?Ivoire                      Ecuador 
##                         0.00                         0.00 
##                        Haiti                        Japan 
##                         0.00                         0.00 
##                         Laos                    Mauritius 
##                         0.00                         0.00 
##                       Panama             Papua New Guinea 
##                         0.00                         0.00 
##                  Philippines                       Rwanda 
##                         0.00                         0.00 
##  United States (Puerto Rico)                       Zambia 
##                         0.00                         0.00

No unusual behavior has been detected in this variable and it works as expected.
We can use the skimr package to check for empty values and we can see that are 36 unique countries and only one NA value.

skim(coffee$country_of_origin)

Data summary
Name	coffee$country_of_origin
Number of rows	1339
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	1	1	4	28	0	36	0

Continuous/Numerical variable: flavor

The score given for flavor account for the intensity, quality and complexity of its combined taste and aroma, experienced when the coffee is slurped into the mouth vigorously so as to involve the entire palate in the evaluation.
Using the glimpse function

glimpse(coffee$country_of_origin)

##  chr [1:1339] "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" "Ethiopia" ...

We can see that country is a numerical variable with 1339 values.

A quick summary of the statistics using skimr package

skim(coffee$flavor)

Data summary
Name	coffee$flavor
Number of rows	1339
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	7.52	0.4	0	7.33	7.58	7.75	8.83	▁▁▁▁▇

There are no missing values for this variable and we can plot a histogram to check the distribution of values.

ggplot(coffee)+
  geom_histogram(aes(flavor),color='red',fill='skyblue')+
  labs(title="Distribution of flavor")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is highly right-skewed. By increasing the number of bins, we can see that distribution is also unimodal. There is only one peak and there are also some outliers, some values that have 0 value.

ggplot(coffee)+
  geom_histogram(aes(flavor),bins=100,color='red',fill='skyblue')+
  labs(title="Distribution of flavor with 100 bins")+
  theme_minimal()

Handling outliers

We can filter out the rows that have 0 values

coffee %>% filter(flavor==0)

## # A tibble: 1 x 9
##   total_cup_points species country_of_origin number_of_bags variety
##              <dbl> <chr>   <chr>                      <dbl> <chr>  
## 1                0 Arabica Honduras                     275 Caturra
## # ... with 4 more variables: processing_method <chr>, flavor <dbl>,
## #   uniformity <dbl>, category_one_defects <dbl>

This row has 0 value for flavor and some other variables are also zero so we can assume it an entry error.We can remove this row by using

coffee<-coffee[coffee$flavor!=0,]

Now again plotting the histogram after removing the outliers and increasing the number of bins to 35, we have an approximate symmetric distribution

ggplot(coffee)+
  geom_histogram(aes(flavor),bins=35,color='red',fill='skyblue')+
  labs(title="Distribution of flavor")+
  theme_minimal()

Since we already have a distribution that is pretty close to normal distribution, we don’t need to apply a transformation function such as log etc.

Central tendency

Since flavor variable is not extremely skewed, mean is a very good measure of central tendency

mean(coffee$flavor)

## [1] 7.526046

Plotting the mean

p+geom_vline(aes(xintercept=mean(coffee$flavor)),
            color="blue", linetype="dashed", size=1)

## Warning: Use of `coffee$flavor` is discouraged. Use `flavor` instead.

Spread using Standard Deviation

sd(coffee$flavor)

## [1] 0.3413824

We can see that most of our data values can fit within 3 standard deviations of the mean.

Bivariate Analysis

Two numerical variables: Uniformity and total_cup_points

Creating a scatter plot between uniformity and the total cup points.
total_cup_points variable contains a a rating out of 100 for the coffee beans.

ggplot(coffee,aes(uniformity,total_cup_points))+
  geom_point(shape=21)+
  labs(title="Correlation between uniformity & cup points",y="cup points")+
  theme_minimal()

We can see that as uniformity increases, total cup points also increase.

Adding a best fit line to the scatter plot, it appears that uniformity has a linear positive relationship with total cup points, meaning the higher the uniformity of flavor, the higher the coffee is rated out of a 100.

ggplot(coffee,aes(uniformity,total_cup_points))+
  geom_point(,shape=21,alpha=1/3)+
  geom_smooth(method='lm', se=FALSE)+
  labs(title="Correlation between uniformity & cup points",y="cup points")+
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

We can calculate the correlation coefficient to know the strength of this relationship

cor(coffee$uniformity,coffee$total_cup_points)

## [1] 0.5147862

It is a fairly strong relationship with 0.5 correlation coefficient.

One Numerical and one Catgorical variables: processing_method and number_of_bags

Making a bar plot for processing_method, we can see there are ‘NA’ values,

ggplot(coffee)+
  geom_bar(aes(processing_method))+
  labs(x='Method')+
  coord_flip()

Removing NA values

We can remove them by

coffee2<-coffee %>% filter(processing_method!='NA')

We can make a density graph to show the relationship between the number of bags tested and the processing method used for them. We have five processing methods.

ggplot(coffee2,aes( x = number_of_bags,fill=processing_method) ) +
  geom_density(alpha = 0.4)+
  labs(title='Bags distribution by Processing Method',x="No. of bags",fill="Method")+
  scale_fill_brewer(palette="Paired")+
  theme_light()

The density plot tells us that the ‘Other’,‘Pulped natural/honey’ and ‘Natural/Dry’ Methods have the most wide range of number of bags while the most frequent methods used were ‘Washed/Wet’ and ‘Semi-washed/Semi-pulped’ as it is evident by the two peaks.

We can further observe this relationship by calculating mean number of bags per processing method, which gives us

coffee2%>% group_by(processing_method)%>%
  summarize(avgbags=round(mean(number_of_bags),2))

## # A tibble: 5 x 2
##   processing_method         avgbags
##   <chr>                       <dbl>
## 1 Natural / Dry               157. 
## 2 Other                       129. 
## 3 Pulped natural / honey      167. 
## 4 Semi-washed / Semi-pulped    94.7
## 5 Washed / Wet                157.

So expanding our observation, on average pulped natural/honey method was used the most and in the density chart we can see that while this method doesn’t have apparent peaks, it has mini peaks that are more spread out meaning it was used more frequently and for more number of bags, hence the higher average.
While Semi Washed/Semi-pulped’ method does have peaks,it is not spread out.

The End