Coffee Plants

Is there a linear relationship between elevation of coffee growth and cup score.

So for this model we need at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Lets start with the easy one; dichotomous. In this set for species there is only Arabica or Robusta Beans. For quadratic, I want to heavily value the rate of defects in the beans, as that is an indicator of other problems, hence the weight for category 1 defects.

Finally, one dichotomous vs. quantitative interaction would be using the processing method combined with category two defects, which are a result of the processing method.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
coffee_data <- read.csv("coffee_graded_data.csv", header = TRUE)
coffee_data_filtered <- coffee_data %>%
  filter(Total.Cup.Points >=70, altitude_mean_meters >= 800, altitude_mean_meters <= 2500)
coffee_data_filtered$moisture_tech <- ifelse(coffee_data_filtered$Processing.Method == 'Washed / Wet', 1, 0)
# Build a linear model for Total Cup Points as a function of altitude_mean_meters

model <- lm(Total.Cup.Points ~ altitude_mean_meters + Species + I(Category.One.Defects^2) + I((moisture_tech)*(Category.Two.Defects)) + Species, data = coffee_data_filtered)



# Evaluate the quality of the model
summary(model)
## 
## Call:
## lm(formula = Total.Cup.Points ~ altitude_mean_meters + Species + 
##     I(Category.One.Defects^2) + I((moisture_tech) * (Category.Two.Defects)) + 
##     Species, data = coffee_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6778  -1.0587   0.1573   1.2595   7.6406 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 79.5129849  0.3327347 238.968
## altitude_mean_meters                         0.0023138  0.0002307  10.027
## SpeciesRobusta                              -0.4362125  0.5731803  -0.761
## I(Category.One.Defects^2)                   -0.0045810  0.0018750  -2.443
## I((moisture_tech) * (Category.Two.Defects)) -0.1518084  0.0146648 -10.352
##                                             Pr(>|t|)    
## (Intercept)                                   <2e-16 ***
## altitude_mean_meters                          <2e-16 ***
## SpeciesRobusta                                0.4468    
## I(Category.One.Defects^2)                     0.0147 *  
## I((moisture_tech) * (Category.Two.Defects))   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.121 on 972 degrees of freedom
## Multiple R-squared:  0.1937, Adjusted R-squared:  0.1904 
## F-statistic: 58.37 on 4 and 972 DF,  p-value: < 2.2e-16
# Perform residual analysis
par(mfrow = c(2, 2))
plot(model, which = 1:4)

Practically, we look for 4 key parts to analyze if a regression is reasonable.

Independence-This is a given spec of the data set

There is a linear relationship, which is assumed in this data set.

There is constant variance, which is loosely shown by the uniform spread of residuals.

There is a normality of Residuals across a reasonable range, which is demonstrated by the chart above.

Cook’s distance indicates that there are a few outliers, that merit additional attention, but this could be due to a host of variables, since coffee bean processing has a high degree of variability between batches.

Looking at the normal Q-Q plot, the left tail deviates more, indicating that there are a higher degree of lower end scores.

Honestly, this is not the best fit data set, and I am mostly just ramming it in. A better fit for this would be something that has more variables, preferably with a higher degree of variance in the numerical sets.

Delving into the coefficients, we see that the Robusta Species has a negative 0.43 indicating that it decreases the cup score. In addition, the Category one defects, category two defects & drying method have a small, yet profound impact to the score, given their extremely small coefficients, but extremely low standard errors.