Report on statistical analysis of the coffee-origin dataset

Introduction

The dataset coffee-origin.txt consists of data obtained from coffee samples by liquid chromatography – mass spectrometry (LC-MS), an analytical chemistry technique. The aim of the study is to investigate the chemical differences between single-origin coffees, from Brazil, Columbia, Ethiopia, Java and Kenya, and to associate these with the levels for six different tastes (citrus, berry, floral, nuts, chocolate and spice) assigned by expert coffee tasters (on a scale of 1 to 5). Information on each observation, together with row names in column 1, is given in the file origin-info.txt. In each case there are 3 repeats of the analysis on the same sample (labelled A, B and C in column 4) for 3 different samples (labelled 1, 2 or 3 in column 3) of 5 single origin coffees (labelled B, C, E, J and K in column 2) giving 45 observations in total. The levels for each taste are given in columns 5-10 of this file.

PCA Analysis

Responding to the first question involved loading both data sets. Table 1 shows first 5 rows of the origin-info data set.

Preview of Origin-Info data
name	origin	replicate	sample	citrus	berry	floral	nuts	chocolate	spice
B1_1	Brazil	1	A	1	2	1	4	4	1
B1_2	Brazil	1	B	1	2	1	4	4	1
B1_3	Brazil	1	C	1	2	1	4	4	1
B2_1	Brazil	2	A	1	2	1	4	4	1
B2_2	Brazil	2	B	1	2	1	4	4	1
B2_3	Brazil	2	C	1	2	1	4	4	1

Exploration of the info dataset shows that rows 1:9 represents Brazil, 10:18 represents Colombia, 19:27 represents Ethiopia, 28:36 represents Java and 37:45 represents. The information is vital for coloring the scatter plot of the first two PCA’s. Figure 1 shows scatter plot of first and second principal component colored by the origin.

Plot of PCA1 and PCA2 coloured by Origin

PLSR Analysis

Prior to performing the partial least squares regression (plsr) the row names were added to the data and response variables combined with independent variables as a matrix. The plsr was fitted on the scaled data with the default method for calculating the components and cross-validation with in ten components. Figure 2 shows the plot of root mean squared separation (RMSEP) of the original plsr.

Plot of RMSEP for firs PLS model

From figure 2, the optimal number of components is 4. The plsr was repeated with four components. The plsr was repeated using 4 components. figure 3 shows plot of X scores and Y scores.

Bi-plot of X and Y scores

From figure 3 the correlation between the two sets of scores is the same for all origins. Figure 4 shows biplot of the Y scores and Y loadings.

Bi-plot of Y scores and loadings

Ethiopian coffee tastes floral and berry, Kenya tastes citrus, Brazilian tastes nuts and chocolate, Java taste spices while Colombian does not relate to any of the tastes.

## Data:    X dimension: 45 18345 
##  Y dimension: 45 6
## Fit method: kernelpls
## Number of components considered: 4
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## 
## Response: citrus 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV           1.503    1.060    1.128   0.2362   0.2315
## adjCV        1.503    1.056    1.159   0.2266   0.2228
## 
## Response: berry 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV           1.043   0.7934   0.8108   0.2949   0.2228
## adjCV        1.043   0.7910   0.8284   0.2907   0.2152
## 
## Response: floral 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV           1.193   0.9021   0.8047   0.5119   0.2483
## adjCV        1.193   0.8997   0.8024   0.5079   0.2373
## 
## Response: nuts 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV           1.193   0.7833   0.5329   0.3469   0.1336
## adjCV        1.193   0.7807   0.4978   0.3411   0.1287
## 
## Response: chocolate 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV           1.227   0.8944   0.5042   0.1398   0.1223
## adjCV        1.227   0.8917   0.4501   0.1332   0.1143
## 
## Response: spice 
##        (Intercept)  1 comps  2 comps  3 comps  4 comps
## CV          0.8182   0.6375   0.2514  0.08283  0.08171
## adjCV       0.8182   0.6361   0.1740  0.07871  0.07851
## 
## TRAINING: % variance explained
##            1 comps  2 comps  3 comps  4 comps
## X            36.37    47.99    56.29    61.10
## citrus       56.58    56.66    99.13    99.18
## berry        47.33    51.97    95.41    98.50
## floral       46.88    64.62    86.26    99.05
## nuts         61.13    89.67    93.20    99.42
## chocolate    51.63    93.07    99.36    99.66
## spice        41.06    99.45    99.51    99.52

The PCA analysis does not contain results for the intercept. The difference is that the PCA select the number of components while plsr select the OLS and the number of components. Next step involved weighing of the loadings for each component in an equation to combine the loadings and obtain 40 variables based on the combined factor loadings.

Multiple linear regression using `plsr` results

The first step in the linear regression involved combining the datasmall with the response variable.

Chocolate response variable

Table 2 shows regression estimates using chocolate as the response variable. Backward selection method was used in the step function.

Regression estimates for chocolate response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.417	1.083	5.923	0.0001
V3979	-0.004	0.001	-3.114	0.0089
V8683	0.000	0.000	-3.878	0.0022
V3091	-0.003	0.001	-1.735	0.1083
V3925	0.001	0.001	1.740	0.1074
V787	-0.001	0.000	-2.916	0.0129
V16177	-0.001	0.000	-3.107	0.0091
V871	0.000	0.000	3.234	0.0072
V4567	0.001	0.001	2.312	0.0393
V1063	0.004	0.001	3.643	0.0034
V1783	-0.001	0.000	-4.646	0.0006
V25	0.001	0.000	1.879	0.0847
V3433	-0.002	0.001	-3.046	0.0102
V7693	0.003	0.001	4.555	0.0007
V12535	0.000	0.000	-1.804	0.0964
V14425	0.002	0.001	1.264	0.2301
V7699	-0.003	0.001	-5.876	0.0001
V3811	-0.002	0.000	-4.057	0.0016
V8335	0.000	0.000	1.426	0.1793
V12553	0.000	0.000	-0.920	0.3755
V14419	0.000	0.000	-1.887	0.0836
V3913	-0.001	0.000	-1.494	0.1609
V2749	0.002	0.001	2.137	0.0538
V8857	0.007	0.002	4.517	0.0007
V8869	0.003	0.001	3.575	0.0038
V8371	-0.007	0.001	-4.704	0.0005
V8767	0.001	0.000	3.026	0.0105
V1327	-0.001	0.000	-1.349	0.2022
V4561	0.001	0.000	9.036	0.0000
V2005	0.002	0.001	2.725	0.0185
V1591	0.005	0.002	2.721	0.0186
V1999	-0.002	0.000	-3.777	0.0026
V8557	-0.002	0.001	-3.064	0.0098

Figure 5 shows the plot of predicted chocolate scores using the forty variables selected by plsr.

Plot of the predicted chocolate

Nuts response variable

Table 3 shows regression estimates using nuts as the response variable. Backward selection method was used in the step function.

Regression estimates for nuts response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	8.687	1.420	6.118	0.0000
V3979	-0.005	0.001	-4.110	0.0011
V8683	0.000	0.000	-4.494	0.0005
V787	-0.001	0.000	-2.945	0.0107
V16177	-0.001	0.000	-2.602	0.0209
V8341	0.001	0.001	2.111	0.0533
V8917	-0.002	0.001	-2.000	0.0653
V871	0.000	0.000	3.271	0.0056
V8329	0.003	0.001	2.571	0.0222
V4567	0.002	0.001	3.260	0.0057
V1063	0.004	0.001	3.613	0.0028
V25	0.001	0.000	1.945	0.0721
V3433	-0.003	0.001	-3.193	0.0065
V7693	0.002	0.001	2.743	0.0159
V14425	0.003	0.001	2.338	0.0348
V7699	-0.004	0.001	-5.279	0.0001
V9643	-0.001	0.001	-1.028	0.3214
V3811	-0.001	0.000	-3.387	0.0044
V7825	-0.001	0.001	-1.167	0.2627
V14419	-0.001	0.000	-4.654	0.0004
V3913	-0.002	0.001	-2.924	0.0111
V2749	0.002	0.001	2.612	0.0205
V8857	0.003	0.001	2.795	0.0143
V8869	0.003	0.001	4.625	0.0004
V8371	-0.009	0.002	-4.973	0.0002
V9049	0.000	0.000	1.348	0.1991
V1327	-0.002	0.001	-3.597	0.0029
V4561	0.001	0.000	4.704	0.0003
V1591	0.003	0.001	2.380	0.0321
V1999	-0.001	0.000	-2.535	0.0238
V8557	-0.001	0.000	-1.245	0.2335

Figure 6 shows the plot of predicted nuts scores using the forty variables selected by plsr.

Plot of the predicted nuts

Citrus response variable

Table 4 shows regression estimates using citrus as the response variable. Backward selection method was used in the step function.

Regression estimates for nuts response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.362	1.716	-0.211	0.8369
V3979	0.007	0.002	2.987	0.0124
V8683	0.000	0.000	2.540	0.0275
V3925	-0.002	0.001	-2.062	0.0637
V787	0.002	0.000	3.668	0.0037
V16177	0.001	0.001	1.938	0.0788
V8917	0.001	0.001	0.979	0.3485
V871	-0.001	0.000	-3.024	0.0116
V4567	-0.002	0.001	-2.497	0.0297
V1063	-0.006	0.002	-3.101	0.0101
V1783	0.002	0.000	4.350	0.0012
V25	-0.001	0.001	-1.743	0.1092
V3433	0.001	0.001	1.093	0.2976
V7693	-0.005	0.001	-4.524	0.0009
V12535	0.001	0.000	2.690	0.0210
V14425	-0.003	0.002	-1.812	0.0974
V7699	0.005	0.001	5.188	0.0003
V3811	0.003	0.001	4.322	0.0012
V8335	-0.001	0.000	-2.221	0.0483
V12553	0.000	0.000	1.352	0.2035
V14419	0.000	0.000	-0.908	0.3836
V3913	0.001	0.001	1.225	0.2461
V2749	-0.003	0.001	-2.311	0.0412
V8857	-0.011	0.003	-3.689	0.0036
V2395	-0.001	0.001	-0.850	0.4137
V8869	-0.005	0.001	-3.517	0.0048
V8371	0.011	0.002	4.353	0.0012
V8767	-0.002	0.001	-2.668	0.0219
V1327	0.001	0.001	1.635	0.1303
V4561	-0.002	0.000	-8.262	0.0000
V2005	-0.002	0.001	-2.010	0.0696
V1591	-0.008	0.003	-2.454	0.0320
V1999	0.002	0.001	3.344	0.0066
V8557	0.003	0.001	2.743	0.0191

Figure 7 shows the plot of predicted citrus scores using the forty variables selected by plsr.

Plot of the predicted citrus

Floral response variable

Table 5 shows regression estimates using floral as the response variable. Backward selection method was used in the step function.

Regression estimates for floral response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-7.294	1.323	-5.512	0.0000
V3979	0.002	0.001	2.670	0.0151
V8683	0.000	0.000	1.179	0.2528
V3091	0.004	0.002	1.875	0.0763
V3925	0.001	0.000	2.320	0.0316
V787	0.001	0.000	2.414	0.0260
V16177	0.001	0.000	4.965	0.0001
V8341	-0.001	0.001	-1.098	0.2859
V8329	-0.002	0.001	-3.054	0.0065
V4567	0.000	0.000	-0.956	0.3512
V1063	-0.001	0.001	-1.445	0.1646
V1783	-0.001	0.000	-5.125	0.0001
V3433	0.002	0.001	3.574	0.0020
V12535	-0.001	0.000	-3.366	0.0032
V14425	-0.002	0.001	-1.611	0.1237
V7699	0.002	0.000	5.398	0.0000
V9643	0.002	0.001	2.234	0.0377
V8335	0.000	0.000	1.443	0.1653
V12553	0.000	0.000	-1.172	0.2558
V7825	0.002	0.001	2.358	0.0292
V14419	0.002	0.000	6.390	0.0000
V3913	0.001	0.000	2.344	0.0301
V8869	-0.001	0.001	-1.888	0.0743
V8371	0.003	0.001	3.161	0.0051
V8767	0.001	0.001	2.794	0.0116
V1327	0.001	0.001	1.545	0.1388

Figure 8 shows the plot of predicted floral scores using the forty variables selected by plsr.

Plot of the predicted nuts

Berry response variable

Table 6 shows regression estimates using berry as the response variable. Backward selection method was used in the step function.

Regression estimates for berry response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-7.115	1.237	-5.754	0.0000
V3979	0.004	0.001	4.829	0.0001
V3091	0.004	0.002	2.356	0.0300
V787	0.001	0.000	2.127	0.0475
V16177	0.002	0.000	5.917	0.0000
V871	0.000	0.000	-2.604	0.0179
V8329	-0.001	0.001	-1.035	0.3143
V4567	-0.001	0.000	-2.833	0.0110
V1063	-0.002	0.001	-2.835	0.0110
V1783	-0.001	0.000	-2.514	0.0217
V3433	0.001	0.001	2.627	0.0171
V12535	0.000	0.000	-2.251	0.0372
V14425	-0.002	0.001	-1.314	0.2054
V7699	0.002	0.000	4.451	0.0003
V9643	0.003	0.001	3.373	0.0034
V3811	0.001	0.000	2.273	0.0355
V7825	0.002	0.001	2.177	0.0431
V14419	0.001	0.000	7.846	0.0000
V3913	0.001	0.000	4.329	0.0004
V2749	-0.001	0.001	-2.424	0.0261
V8857	-0.003	0.001	-3.073	0.0066
V8869	-0.001	0.001	-1.398	0.1790
V8371	0.003	0.001	3.896	0.0011
V2005	-0.001	0.000	-2.661	0.0159
V1591	-0.003	0.001	-2.303	0.0334
V1999	0.001	0.000	3.647	0.0018
V8557	0.001	0.000	4.713	0.0002

Figure 9 shows the plot of predicted berry scores using the forty variables selected by plsr.

Plot of the predicted berry

Spice response variable

Table 7 shows regression estimates using spice as the response variable. Backward selection method was used in the step function.

Regression estimates for spice response
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.227	0.449	13.867	0.0000
V3979	-0.002	0.000	-5.839	0.0001
V3091	-0.001	0.001	-1.852	0.0869
V3925	0.000	0.000	2.052	0.0609
V787	-0.001	0.000	-5.084	0.0002
V16177	-0.001	0.000	-10.493	0.0000
V7453	0.000	0.000	2.348	0.0353
V871	0.000	0.000	4.937	0.0003
V8329	0.001	0.000	1.536	0.1486
V4567	0.001	0.000	3.869	0.0019
V1063	0.002	0.000	6.022	0.0000
V1783	0.000	0.000	-2.719	0.0176
V3433	0.000	0.000	-1.715	0.1100
V7693	0.001	0.000	5.061	0.0002
V14425	0.002	0.001	2.815	0.0146
V7699	-0.002	0.000	-9.149	0.0000
V9643	-0.001	0.000	-3.935	0.0017
V3811	-0.001	0.000	-8.213	0.0000
V12553	0.000	0.000	-1.153	0.2696
V7825	-0.001	0.000	-3.114	0.0082
V3913	0.000	0.000	-1.585	0.1371
V2749	0.001	0.000	5.448	0.0001
V8857	0.003	0.000	7.780	0.0000
V2395	0.001	0.000	5.248	0.0002
V8869	0.002	0.000	5.818	0.0001
V8371	-0.003	0.000	-5.688	0.0001
V9049	0.000	0.000	-4.283	0.0009
V4561	0.001	0.000	7.198	0.0000
V2005	0.001	0.000	5.957	0.0000
V1591	0.002	0.001	3.251	0.0063
V1999	-0.001	0.000	-7.576	0.0000
V8557	-0.002	0.000	-7.488	0.0000

Figure 10 shows the plot of predicted spice scores using the forty variables selected by plsr.

Plot of the predicted spice

A good model for each of the regression result can be obtained by eliminating the variables with insignificant coefficients. However, the models might be flawed since the response variables are ordinal with four categories. Also, the assumptions of linear regression were to checked therefore,some of the assumptions might not be satisfied by the fitted models.

Appendix: R-commands

# Load the coffee-origin.txt
data = read.table("Data Sets/coffee-origin.txt")

# Load the information data
info = read.table("Data Sets/origin-info.txt", header = T)

# Load library Knitr (making report level tables)
library(knitr)

# Preview the info data
kable(head(info), caption = "Preview of Origin-Info data")


########### Question 1 ###########

# Perform PC analysis
pca_result = prcomp(data) 

# Scatter plot of PCA1 and PCA2
plot(pca_result$x[,1],pca_result$x[,2], xlab="PC1", ylab="PC2", 
     col = "green")+
  points(pca_result$x[10:18,1],pca_result$x[10:18,2], col="blue") +
  points(pca_result$x[19:27,1],pca_result$x[19:27,2], col="red") +
  points(pca_result$x[28:36,1],pca_result$x[28:36,2], col="black") +
  points(pca_result$x[37:45,1],pca_result$x[37:45,2], col="orange")
# Add a legend
legend(x = -140, y = 55,  
         legend = c("Brazil", "Colombia", "Ethiopia", "Java", "Kenya"), 
         col = c("green", "blue", "red", "black", "orange"), 
         pch=1, box.col=NA)


########### Question 2&3 ###########

#add row names
row.names(data) = info$name

# Combine the data and response
coffee.df =
data.frame(X=I(as.matrix(data)),Y=I(as.matrix(info[,5:10])))

# Load the library pls
library(pls)

#make the analysis reproducible
set.seed(100)

#fit PLSR model
model1 <- plsr(Y~., ncomp = 10, data = coffee.df, scale = TRUE, validation="CV")



# RMSEP score for the first model
plot(RMSEP(model), legendpos = "topright")

#make the analysis reproducible
set.seed(200)

#fit PLSR model
model <- plsr(Y~., ncomp = 4, 
               data = coffee.df, scale = TRUE, validation="CV")
# Bi-plot of scores
biplot(model, comps = 1:2, which = "scores", cex = 0.6, main = "")


# biplot of the scores
biplot(model, comps = 1:2, which = "y", cex = 0.6, 
       main = "")

# check the summary
summary(model)

# The coefficients are
C1 = c(1.0600, 0.7934, 0.9021, 0.7833, 0.8944, 0.6375)
C2 = c(1.1280, 0.8108, 0.8047, 0.5329, 0.5042, 0.2514)
C3 = c(0.23620, 0.29490, 0.51190, 0.34690, 0.13980, 0.08283)
C4 = c(0.23150, 0.22280, 0.24830, 0.13360, 0.12230, 0.08171)

# Calculate the combined factor loadings
top = C1*model$loadings[,1]*model$loadings[,1] +
  C2*model$loadings[,2]*model$loadings[,2] +
  C3*model$loadings[,3]*model$loadings[,3] +
  C4*model$loadings[,4]*model$loadings[,4]

# Reduce the dataset 
topvarnums = order(top)[18305:18344]
datasmall = data[,topvarnums]

# Combine the datasmall and the response variables
datasmall = cbind(datasmall, info[,5:10])

# fit the lm using chocolate as response
model = lm(chocolate~., data = datasmall[,-c(41:44, 46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for chocolate response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,45])

legend("topright",legend = unique(datasmall[,45]), 
       pch = 20, col = unique(datasmall[,45]), 
       cex = 0.6) 

model = lm(nuts~., data = datasmall[,-c(41, 42, 43, 45:46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for nuts response", 
      digits = c(rep(3,3), 4))

plot(model$fitted.values, col = datasmall[,44])

legend("topright",legend = unique(datasmall[,44]), 
       pch = 20, col = unique(datasmall[,44]), 
       cex = 0.6) 

# fit the lm using nuts as response
model = lm(nuts~., data = datasmall[,-c(41, 42, 43, 45:46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for nuts response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,44])

legend("topright",legend = unique(datasmall[,44]), 
       pch = 20, col = unique(datasmall[,44]), 
       cex = 0.6) 

# fit the lm using citrus as response
model = lm(citrus~., data = datasmall[,-c(42:46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for nuts response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,41])

legend("topright",legend = unique(datasmall[,41]), 
       pch = 20, col = unique(datasmall[,41]), 
       cex = 0.6) 

# fit the lm using floral as response
model = lm(floral~., data = datasmall[,-c(41, 42, 44:46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for floral response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,43])

legend("topright",legend = unique(datasmall[,43]), 
       pch = 20, col = unique(datasmall[,43]), 
       cex = 0.6) 

# fit the lm using berry as response
model = lm(berry~., data = datasmall[,-c(41, 43:46)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for berry response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,42])

legend("topright",legend = unique(datasmall[,42]), 
       pch = 20, col = unique(datasmall[,42]), 
       cex = 0.6) 

# fit the lm using spice as response
model = lm(spice~., data = datasmall[,-c(41:45)])

# Reduce the number of variables using step function
model <- step(model, direction = "backward", trace=0, k=2)
summary1 <- summary(model)
kable(summary1$coefficients, 
      caption = "Regression estimates for spice response", 
      digits = c(rep(3,3), 4))

# Plot the fitted values
plot(model$fitted.values, col = datasmall[,46])

legend("topright",legend = unique(datasmall[,46]), 
       pch = 20, col = unique(datasmall[,46]), 
       cex = 0.6)