| total_cup_points | country_of_origin | region | species | grading_date | aroma | flavor | aftertaste | acidity | body | balance | uniformity | clean_cup | sweetness | cupper_points | moisture | expiration | altitude_low_meters | altitude_high_meters | altitude_mean_meters | number_of_bags | bag_weight | owner_1 | variety | processing_method |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 90.58 | Ethiopia | guji-hambela | Arabica | April 4th, 2015 | 8.67 | 8.83 | 8.67 | 8.75 | 8.50 | 8.42 | 10 | 10 | 10 | 8.75 | 0.12 | April 3rd, 2016 | 1950 | 2200 | 2075 | 300 | 60 kg | metad plc | NA | Washed / Wet |
| 89.92 | Ethiopia | guji-hambela | Arabica | April 4th, 2015 | 8.75 | 8.67 | 8.50 | 8.58 | 8.42 | 8.42 | 10 | 10 | 10 | 8.58 | 0.12 | April 3rd, 2016 | 1950 | 2200 | 2075 | 300 | 60 kg | metad plc | Other | Washed / Wet |
| 89.75 | Guatemala | NA | Arabica | May 31st, 2010 | 8.42 | 8.50 | 8.42 | 8.42 | 8.33 | 8.42 | 10 | 10 | 10 | 9.25 | 0.00 | May 31st, 2011 | 1600 | 1800 | 1700 | 5 | 1 | Grounds for Health Admin | Bourbon | NA |
| 89.00 | Ethiopia | oromia | Arabica | March 26th, 2015 | 8.17 | 8.58 | 8.42 | 8.42 | 8.50 | 8.25 | 10 | 10 | 10 | 8.67 | 0.11 | March 25th, 2016 | 1800 | 2200 | 2000 | 320 | 60 kg | Yidnekachew Dabessa | NA | Natural / Dry |
| 88.83 | Ethiopia | guji-hambela | Arabica | April 4th, 2015 | 8.25 | 8.50 | 8.25 | 8.50 | 8.42 | 8.33 | 10 | 10 | 10 | 8.58 | 0.12 | April 3rd, 2016 | 1950 | 2200 | 2075 | 300 | 60 kg | metad plc | Other | Washed / Wet |
| 88.83 | Brazil | NA | Arabica | September 3rd, 2013 | 8.58 | 8.42 | 8.42 | 8.50 | 8.25 | 8.33 | 10 | 10 | 10 | 8.33 | 0.11 | September 3rd, 2014 | NA | NA | NA | 100 | 30 kg | Ji-Ae Ahn | NA | Natural / Dry |
DataDiscoveryCoffee
COFFEE RANKINGS
Want to start out by looking at the data
COFFEE RANKINGS
There’s tons of columns that would be interesting to test but we’ll start off by looking at the relationship between country of origin and the overall score. There was a lot of different nations in the data set so I picked the 12 with at least 30 different coffees
ONE WAY ANOVA
- \(H_0\): The mean cup score is the same for each country. \(H_A\) At least one country has a different mean
- We have at least 30 for each country and are comparing a numeric and categorical variable, so we can use a one way ANOVA for this data.
Df Sum Sq Mean Sq F value Pr(>F)
country_of_origin 9 1231 136.78 22.66 <2e-16 ***
Residuals 1053 6356 6.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The test statistic is 22.66 and comes from an F distribution with 9 and 1053 d.f. when \(H_0\) is true.
- The p-value is below the 0.05 threshold for significance at: 2*10^-16
- We conclude that there is a difference between the mean cup score for at least one country in the group
TWO WAY ANOVA
- \(H_0\): The mean cup score is the same for each country and uniformity score. \(H_A\) At least one country or uniformity score has a different mean. $H_0$2: There is no interaction between country and uniformity score. $H_A$2 There is interaction between country and uniformity score.
- Uniformity score is a numeric variable, but if we put it into buckets then it will be categorical and we can use Two Way ANOVA.
Df Sum Sq Mean Sq F value Pr(>F)
uniformity_bucket 3 2180 726.6 171.74 <2e-16 ***
country_of_origin 9 965 107.2 25.35 <2e-16 ***
Residuals 1050 4442 4.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The test statistic is 171.74 and 25.35 and comes from an F distribution with 3, 9 and 1050 d.f. when \(H_0\) is true.
- The p-value is below the 0.05 threshold for significance at: 2*10^-16
- We conclude that there is a difference between the mean cup score for at least one country in the group and for the uniformity score.
Check For Interaction
As we can see the lines are not uniform, signifying some type of interaction between the country of origin and the uniformity score.
Df Sum Sq Mean Sq F value Pr(>F)
uniformity_bucket 3 2180 726.6 181.930 < 2e-16 ***
country_of_origin 9 965 107.2 26.851 < 2e-16 ***
uniformity_bucket:country_of_origin 20 329 16.4 4.114 4.16e-09 ***
Residuals 1030 4113 4.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The test statistic for interaction is 4.114 and comes from an F distribution with 3, 9 and 1050 d.f. when \(H_0\) is true.
- The p-value for interaction is below the 0.05 threshold for significance at: 4.16*10^-9.
- We conclude that there is an interaction between country of origin and uniformity score, which supports what we saw in the graph.
One Way Anova Test for Processing Method
\(H_0\): The mean cup score is the same for each type of processing method. \(H_A\) At least one method has a different mean
We have at least 30 observations for each processing method so we can use the ANOVA test.
Df Sum Sq Mean Sq F value Pr(>F)
processing_method 4 58 14.468 1.978 0.0956 .
Residuals 1164 8514 7.314
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The test statistic is 1.978 and comes from an F distribution with 4 and 1164 d.f. when \(H_0\) is true.
- The p-value is above the 0.05 threshold for significance at: 0.095.
- The test statistic is 1.978 and comes from an F distribution with 9 and 1053 d.f. when \(H_0\) is true.
- The p-value is above the 0.05 threshold for significance at: 0.0959
- We cannot conclude that there is a difference between the mean cup score for the processing methods.
ANOVA For Altitude
- \(H_0\): The mean cup score is the same for no matter the altitude range. \(H_A\) At least one altitude range has a different mean
| Var1 | Freq |
|---|---|
| 0 to 500 m | 62 |
| 500 to 1000 m | 163 |
| 1000 to 1500 m | 525 |
| 1500 to 2000 m | 326 |
| 2000 m+ | 32 |
- There is at lest 30 in each category so we can use ANOVA
Df Sum Sq Mean Sq F value Pr(>F)
altitude_bucket 4 482 120.39 18.57 8.74e-15 ***
Residuals 1103 7149 6.48
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The test statistic is 23.35 and comes from an F distribution with 3 and 1072 d.f. when \(H_0\) is true.
- The p-value is below the 0.05 threshold for significance at: 8.74*10^-15.
- We can conclude that there is a difference between the mean cup score for the different altitudes.
What if we wanted to know which altitudes made the best coffee?