In this project, I performed univariate and bivariate analysis on different variables from coffee ratings dataset which I imported from tidy tuesday.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(skimr)
library(RColorBrewer)
The data set contains different variables, all related to different characteristics of coffee.
coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## total_cup_points = col_double(),
## number_of_bags = col_double(),
## aroma = col_double(),
## flavor = col_double(),
## aftertaste = col_double(),
## acidity = col_double(),
## body = col_double(),
## balance = col_double(),
## uniformity = col_double(),
## clean_cup = col_double(),
## sweetness = col_double(),
## cupper_points = col_double(),
## moisture = col_double(),
## category_one_defects = col_double(),
## quakers = col_double(),
## category_two_defects = col_double(),
## altitude_low_meters = col_double(),
## altitude_high_meters = col_double(),
## altitude_mean_meters = col_double()
## )
## i Use `spec()` for the full column specifications.
coffee<- coffee_ratings %>% select(total_cup_points,
species,
country_of_origin,
number_of_bags,
variety,
processing_method,
flavor,
uniformity,
category_one_defects
) %>%view()
First variable I am using for univariate analysis is country_of_origin which is the country where the coffee is from.
Using the glimpse function
glimpse(coffee$country_of_origin)
## chr [1:1339] "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" "Ethiopia" ...
We can see that country is a categorical variable with 1339 values
Mexico is the country where most of the coffee comes from.
ggplot(coffee)+
geom_bar(aes(country_of_origin))+
labs(title="Count of Countries",x="Count",y="Country")+
coord_flip()+
theme_minimal()
I have shown the distribution of countries in this variable by proportion which also confirms that Mexico indeed has the biggest proportion out of all countries, followed by Colombia and Guatemala.
sort(round(proportions(table(coffee$country_of_origin)),2),decreasing=TRUE)
##
## Mexico Colombia
## 0.18 0.14
## Guatemala Brazil
## 0.14 0.10
## Taiwan United States (Hawaii)
## 0.06 0.05
## Costa Rica Honduras
## 0.04 0.04
## Ethiopia Tanzania, United Republic Of
## 0.03 0.03
## Uganda El Salvador
## 0.03 0.02
## Kenya Nicaragua
## 0.02 0.02
## Thailand China
## 0.02 0.01
## India Indonesia
## 0.01 0.01
## Malawi Myanmar
## 0.01 0.01
## Peru United States
## 0.01 0.01
## Vietnam Burundi
## 0.01 0.00
## Cote d?Ivoire Ecuador
## 0.00 0.00
## Haiti Japan
## 0.00 0.00
## Laos Mauritius
## 0.00 0.00
## Panama Papua New Guinea
## 0.00 0.00
## Philippines Rwanda
## 0.00 0.00
## United States (Puerto Rico) Zambia
## 0.00 0.00
No unusual behavior has been detected in this variable and it works as expected.
We can use the skimr package to check for empty values and we can see that are 36 unique countries and only one NA value.
skim(coffee$country_of_origin)
Name | coffee$country_of_origin |
Number of rows | 1339 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 1 | 1 | 4 | 28 | 0 | 36 | 0 |
The score given for flavor account for the intensity, quality and complexity of its combined taste and aroma, experienced when the coffee is slurped into the mouth vigorously so as to involve the entire palate in the evaluation.
Using the glimpse function
glimpse(coffee$country_of_origin)
## chr [1:1339] "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" "Ethiopia" ...
We can see that country is a numerical variable with 1339 values.
A quick summary of the statistics using skimr package
skim(coffee$flavor)
Name | coffee$flavor |
Number of rows | 1339 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 0 | 1 | 7.52 | 0.4 | 0 | 7.33 | 7.58 | 7.75 | 8.83 | ▁▁▁▁▇ |
There are no missing values for this variable and we can plot a histogram to check the distribution of values.
ggplot(coffee)+
geom_histogram(aes(flavor),color='red',fill='skyblue')+
labs(title="Distribution of flavor")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution is highly right-skewed. By increasing the number of bins, we can see that distribution is also unimodal. There is only one peak and there are also some outliers, some values that have 0 value.
ggplot(coffee)+
geom_histogram(aes(flavor),bins=100,color='red',fill='skyblue')+
labs(title="Distribution of flavor with 100 bins")+
theme_minimal()
We can filter out the rows that have 0 values
coffee %>% filter(flavor==0)
## # A tibble: 1 x 9
## total_cup_points species country_of_origin number_of_bags variety
## <dbl> <chr> <chr> <dbl> <chr>
## 1 0 Arabica Honduras 275 Caturra
## # ... with 4 more variables: processing_method <chr>, flavor <dbl>,
## # uniformity <dbl>, category_one_defects <dbl>
This row has 0 value for flavor and some other variables are also zero so we can assume it an entry error.We can remove this row by using
coffee<-coffee[coffee$flavor!=0,]
Now again plotting the histogram after removing the outliers and increasing the number of bins to 35, we have an approximate symmetric distribution
ggplot(coffee)+
geom_histogram(aes(flavor),bins=35,color='red',fill='skyblue')+
labs(title="Distribution of flavor")+
theme_minimal()
Since we already have a distribution that is pretty close to normal distribution, we don’t need to apply a transformation function such as log etc.
Since flavor variable is not extremely skewed, mean is a very good measure of central tendency
mean(coffee$flavor)
## [1] 7.526046
p+geom_vline(aes(xintercept=mean(coffee$flavor)),
color="blue", linetype="dashed", size=1)
## Warning: Use of `coffee$flavor` is discouraged. Use `flavor` instead.
sd(coffee$flavor)
## [1] 0.3413824
We can see that most of our data values can fit within 3 standard deviations of the mean.
Creating a scatter plot between uniformity and the total cup points.
total_cup_points variable contains a a rating out of 100 for the coffee beans.
ggplot(coffee,aes(uniformity,total_cup_points))+
geom_point(shape=21)+
labs(title="Correlation between uniformity & cup points",y="cup points")+
theme_minimal()
We can see that as uniformity increases, total cup points also increase.
Adding a best fit line to the scatter plot, it appears that uniformity has a linear positive relationship with total cup points, meaning the higher the uniformity of flavor, the higher the coffee is rated out of a 100.
ggplot(coffee,aes(uniformity,total_cup_points))+
geom_point(,shape=21,alpha=1/3)+
geom_smooth(method='lm', se=FALSE)+
labs(title="Correlation between uniformity & cup points",y="cup points")+
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
We can calculate the correlation coefficient to know the strength of this relationship
cor(coffee$uniformity,coffee$total_cup_points)
## [1] 0.5147862
It is a fairly strong relationship with 0.5 correlation coefficient.
Making a bar plot for processing_method, we can see there are ‘NA’ values,
ggplot(coffee)+
geom_bar(aes(processing_method))+
labs(x='Method')+
coord_flip()
We can remove them by
coffee2<-coffee %>% filter(processing_method!='NA')
We can make a density graph to show the relationship between the number of bags tested and the processing method used for them. We have five processing methods.
ggplot(coffee2,aes( x = number_of_bags,fill=processing_method) ) +
geom_density(alpha = 0.4)+
labs(title='Bags distribution by Processing Method',x="No. of bags",fill="Method")+
scale_fill_brewer(palette="Paired")+
theme_light()
The density plot tells us that the ‘Other’,‘Pulped natural/honey’ and ‘Natural/Dry’ Methods have the most wide range of number of bags while the most frequent methods used were ‘Washed/Wet’ and ‘Semi-washed/Semi-pulped’ as it is evident by the two peaks.
We can further observe this relationship by calculating mean number of bags per processing method, which gives us
coffee2%>% group_by(processing_method)%>%
summarize(avgbags=round(mean(number_of_bags),2))
## # A tibble: 5 x 2
## processing_method avgbags
## <chr> <dbl>
## 1 Natural / Dry 157.
## 2 Other 129.
## 3 Pulped natural / honey 167.
## 4 Semi-washed / Semi-pulped 94.7
## 5 Washed / Wet 157.
So expanding our observation, on average pulped natural/honey method was used the most and in the density chart we can see that while this method doesn’t have apparent peaks, it has mini peaks that are more spread out meaning it was used more frequently and for more number of bags, hence the higher average.