We conducted an analysis on the nutritonal facts of the Starbucks Drink Menu. The data set was pulled from the online repository kaggle.com and the following report outlines a brief breakdown of said data using various regression and classification models. The uncleaned data set itself is composed of 18 variables with 242 observations. The two variables of interest to us in this analysis is sugar content (measured in grams) and the beverage category (nine possible classifications). For our linear regression, we regress sugar content against various independent variables to determine what nutritional characterisitcs have the largest effect on determining the amount of sugar in an iten from the Starbucks Drink Menu. For our classification models, we use various algorithms to try and predict what beverage category a drink is placed into based on its nutritional aspects.
Below are the libraries used in this analyis as well as the function set.seed which allows for consistent reproctability throughout the project.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(e1071)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
We need to import the raw data set into our R enviroment so that we may clean and preprocess the set before we begin partitioning the data.
SBUX_Data<-read.csv('/Users/samizaia/Downloads/starbucks-menu (1)/starbucks_drinkMenu_expanded_revised.csv')
head(SBUX_Data)
## Beverage_category Beverage Beverage_prep Calories
## 1 Coffee Brewed Coffee Short 3
## 2 Coffee Brewed Coffee Tall 4
## 3 Coffee Brewed Coffee Grande 5
## 4 Coffee Brewed Coffee Venti 5
## 5 Classic Espresso Drinks Caffè Latte Short Nonfat Milk 70
## 6 Classic Espresso Drinks Caffè Latte 2% Milk 100
## Total.Fat..g. Trans.Fat..g. Saturated.Fat..g. Sodium..mg.
## 1 0.1 0.0 0.0 0
## 2 0.1 0.0 0.0 0
## 3 0.1 0.0 0.0 0
## 4 0.1 0.0 0.0 0
## 5 0.1 0.1 0.0 5
## 6 3.5 2.0 0.1 15
## Total.Carbohydrates..g. Cholesterol..mg. Dietary.Fibre..g. Sugars..g.
## 1 5 0 0 0
## 2 10 0 0 0
## 3 10 0 0 0
## 4 10 0 0 0
## 5 75 10 0 9
## 6 85 10 0 9
## Protein..g. Vitamin.A....DV. Vitamin.C....DV. Calcium....DV. Iron....DV.
## 1 0.3 0.0 0 0.00 0
## 2 0.5 0.0 0 0.00 0
## 3 1.0 0.0 0 0.00 0
## 4 1.0 0.0 0 0.02 0
## 5 6.0 0.1 0 0.20 0
## 6 6.0 0.1 0 0.20 0
## Caffeine..mg.
## 1 175
## 2 260
## 3 330
## 4 410
## 5 75
## 6 75
dim(SBUX_Data)
## [1] 242 18
In order to manipulate the data using various algorithms, we must first delete unecessary variables and observations. Furthermore, for the sake of efficiency, we will rename the variable headers so as to make it easier to call specific varaibles in our code.
###Deletion###
SBUX_Data$Beverage<-NULL
SBUX_Data$Beverage_prep<-NULL
dim(SBUX_Data)
## [1] 242 16
###Given that we only needed one 'factor' variable for our analysis we decided to delete the above two variables.
###Renaming Headers###
SBUX_Data$Total_Fat<-SBUX_Data$Total.Fat..g.
SBUX_Data$Trans_Fat<-SBUX_Data$Trans.Fat..g.
SBUX_Data$Saturated_Fat<-SBUX_Data$Saturated.Fat..g.
SBUX_Data$Sodium<-SBUX_Data$Sodium..mg.
SBUX_Data$Carbs<-SBUX_Data$Total.Carbohydrates..g.
SBUX_Data$Cholesterol<-SBUX_Data$Cholesterol..mg.
SBUX_Data$Fibre<-SBUX_Data$Dietary.Fibre..g.
SBUX_Data$Sugars<-SBUX_Data$Sugars..g.
SBUX_Data$Protein<-SBUX_Data$Protein..g.
SBUX_Data$Vitamin_A<-SBUX_Data$Vitamin.A....DV.
SBUX_Data$Vitamin_C<-SBUX_Data$Vitamin.C....DV.
SBUX_Data$Calcium<-SBUX_Data$Calcium....DV.
SBUX_Data$Iron<-SBUX_Data$Iron....DV.
SBUX_Data$Caffeine<-SBUX_Data$Caffeine..mg.
dim(SBUX_Data)
## [1] 242 30
###Renaming the headers duplicated the variable so now we have to delete the columns with the old headers.
###Deleting all the original columns that we renamed above
SBUX_Data$Total.Fat..g.<-NULL
SBUX_Data$Trans.Fat..g.<-NULL
SBUX_Data$Saturated.Fat..g.<-NULL
SBUX_Data$Sodium..mg.<-NULL
SBUX_Data$Total.Carbohydrates..g.<-NULL
SBUX_Data$Cholesterol..mg.<-NULL
SBUX_Data$Dietary.Fibre..g.<-NULL
SBUX_Data$Sugars..g.<-NULL
SBUX_Data$Protein..g.<-NULL
SBUX_Data$Vitamin.A....DV.<-NULL
SBUX_Data$Vitamin.C....DV.<-NULL
SBUX_Data$Calcium....DV.<-NULL
SBUX_Data$Iron....DV.<-NULL
SBUX_Data$Caffeine..mg.<-NULL
dim(SBUX_Data)
## [1] 242 16
###Confirming our new data set
head(SBUX_Data)
## Beverage_category Calories Total_Fat Trans_Fat Saturated_Fat
## 1 Coffee 3 0.1 0.0 0.0
## 2 Coffee 4 0.1 0.0 0.0
## 3 Coffee 5 0.1 0.0 0.0
## 4 Coffee 5 0.1 0.0 0.0
## 5 Classic Espresso Drinks 70 0.1 0.1 0.0
## 6 Classic Espresso Drinks 100 3.5 2.0 0.1
## Sodium Carbs Cholesterol Fibre Sugars Protein Vitamin_A Vitamin_C
## 1 0 5 0 0 0 0.3 0.0 0
## 2 0 10 0 0 0 0.5 0.0 0
## 3 0 10 0 0 0 1.0 0.0 0
## 4 0 10 0 0 0 1.0 0.0 0
## 5 5 75 10 0 9 6.0 0.1 0
## 6 15 85 10 0 9 6.0 0.1 0
## Calcium Iron Caffeine
## 1 0.00 0 175
## 2 0.00 0 260
## 3 0.00 0 330
## 4 0.02 0 410
## 5 0.20 0 75
## 6 0.20 0 75
###Compared to the original data set, we now have a cleaned set ready to be partitioned.
Below we will be partitioning our cleaned data set so as to allow for regularization and validation of the models we will build.
###Creates vector that randomly samples from Sugar Column
Row_Train<-createDataPartition(SBUX_Data$Sugars, p=.70, list = FALSE)
###Creating the Training Set
Train_SBUX<-SBUX_Data[Row_Train,]
###Stores Row-Train in a training set
dim(Train_SBUX)
## [1] 171 16
###Creating the Holdout set (30% of the cleaned data set)
Holdout_SBUX<-SBUX_Data[-Row_Train,]
dim(Holdout_SBUX)
## [1] 71 16
###Partitioning the 30% data from the holdout data set to create a validation and test set
Validation_Interval<-createDataPartition(y=Holdout_SBUX$Sugars, p=.50, list = FALSE)
Val_SBUX<-Holdout_SBUX[Validation_Interval,]
dim(Val_SBUX)
## [1] 36 16
###Validation Set: stores 50% of the data pulled from the holdout data into our validation set
###Test Set
Test_SBUX<-Holdout_SBUX[-Validation_Interval,]
dim(Test_SBUX)
## [1] 35 16
###stores the other 50% of the data pulled from the holdout data, creating our test set
Now we will run several linear regression models where sugars (grams) will be the dependent variable. In other words, we are looking to see what model best explains the variations in sugar content for items on the Starbucks Drink Menu.
###Regression #1
Model_1<-lm(Sugars~Calories+Total_Fat+Trans_Fat+Saturated_Fat+Sodium+Carbs+Cholesterol+Fibre+Protein+Vitamin_A+Vitamin_C+Calcium+Iron+Caffeine, Train_SBUX)
summary(Model_1)
##
## Call:
## lm(formula = Sugars ~ Calories + Total_Fat + Trans_Fat + Saturated_Fat +
## Sodium + Carbs + Cholesterol + Fibre + Protein + Vitamin_A +
## Vitamin_C + Calcium + Iron + Caffeine, data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9427 -0.5494 -0.0476 0.5464 3.8558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.060401 0.193071 -0.313 0.75482
## Calories 0.049534 0.015906 3.114 0.00219 **
## Total_Fat -1.060619 0.201247 -5.270 4.47e-07 ***
## Trans_Fat 0.691896 0.166270 4.161 5.21e-05 ***
## Saturated_Fat 6.873646 3.618059 1.900 0.05930 .
## Sodium 0.018111 0.035287 0.513 0.60850
## Carbs 0.003359 0.001655 2.030 0.04408 *
## Cholesterol 0.786661 0.062293 12.628 < 2e-16 ***
## Fibre -0.973950 0.216018 -4.509 1.27e-05 ***
## Protein -0.853339 0.097704 -8.734 3.59e-15 ***
## Vitamin_A -2.654221 1.980987 -1.340 0.18224
## Vitamin_C 1.546975 1.000513 1.546 0.12409
## Calcium 18.050381 2.901700 6.221 4.35e-09 ***
## Iron -2.192690 1.887465 -1.162 0.24713
## Caffeine -0.002025 0.001171 -1.728 0.08591 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9339 on 156 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.9978
## F-statistic: 5466 on 14 and 156 DF, p-value: < 2.2e-16
###R-squared=.998
###left out beverage category bc its categorical
###Calories significant @ the .01 level
###Total_Fat & Trans_Fat signifcant @ the .001 level
###Carbs significant @ the .05 level
###Cholesterol & Fibre & Protein & Calcium signifiant @ the .001 level
##Refined Model
Model_2<-lm(Sugars~Calories+Total_Fat+Trans_Fat+Carbs+Cholesterol+Fibre+Protein+Calcium, Train_SBUX)
summary(Model_2) ###R-squared=.9973
##
## Call:
## lm(formula = Sugars ~ Calories + Total_Fat + Trans_Fat + Carbs +
## Cholesterol + Fibre + Protein + Calcium, data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2379 -0.5144 0.0822 0.4502 5.1405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1855653 0.1929857 -0.962 0.337709
## Calories 0.0689675 0.0123318 5.593 9.32e-08 ***
## Total_Fat -1.0208702 0.1708485 -5.975 1.41e-08 ***
## Trans_Fat 0.6673095 0.1829494 3.648 0.000357 ***
## Carbs 0.0008673 0.0017518 0.495 0.621204
## Cholesterol 0.7156455 0.0485084 14.753 < 2e-16 ***
## Fibre -1.3930039 0.1614641 -8.627 5.53e-15 ***
## Protein -0.6727513 0.0903477 -7.446 5.38e-12 ***
## Calcium 8.7572744 2.4723594 3.542 0.000519 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.058 on 162 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
## F-statistic: 7453 on 8 and 162 DF, p-value: < 2.2e-16
###all variables are significant @ the .001 level except for Carbs
##Refined Model 2
Model_3<-lm(Sugars~Calories+Total_Fat+Trans_Fat+Cholesterol+Fibre+Protein+Calcium, Train_SBUX)
summary(Model_3) ###R-squared=.9973
##
## Call:
## lm(formula = Sugars ~ Calories + Total_Fat + Trans_Fat + Cholesterol +
## Fibre + Protein + Calcium, data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3937 -0.5017 0.0818 0.4861 5.0824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.18411 0.19252 -0.956 0.340328
## Calories 0.07003 0.01212 5.780 3.71e-08 ***
## Total_Fat -1.03773 0.16703 -6.213 4.18e-09 ***
## Trans_Fat 0.68529 0.17889 3.831 0.000182 ***
## Cholesterol 0.71386 0.04826 14.791 < 2e-16 ***
## Fibre -1.38765 0.16073 -8.634 5.16e-15 ***
## Protein -0.67938 0.08914 -7.621 1.95e-12 ***
## Calcium 8.93430 2.44070 3.661 0.000340 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 163 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
## F-statistic: 8558 on 7 and 163 DF, p-value: < 2.2e-16
##Refined Model 3
Model_4<-lm(Sugars~Calories+Total_Fat+Trans_Fat+Cholesterol, Train_SBUX)
summary(Model_4) ###R-Squared = .9712
##
## Call:
## lm(formula = Sugars ~ Calories + Total_Fat + Trans_Fat + Cholesterol,
## data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1589 -0.6712 0.7179 1.8484 7.1784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95001 0.59819 1.588 0.11416
## Calories -0.04912 0.01760 -2.791 0.00588 **
## Total_Fat -0.15619 0.30382 -0.514 0.60787
## Trans_Fat 1.07691 0.37517 2.870 0.00463 **
## Cholesterol 1.12713 0.07180 15.699 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.404 on 166 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.9705
## F-statistic: 1401 on 4 and 166 DF, p-value: < 2.2e-16
###going to add back in Fibre, Protein, and Calcium and going to remove Total_Fat, Trans_Fat & Calories
##Refined Model 4
Model_5<-lm(Sugars~Cholesterol+Fibre+Protein+Calcium, Train_SBUX)
summary(Model_5) ##R-Squared = .9966
##
## Call:
## lm(formula = Sugars ~ Cholesterol + Fibre + Protein + Calcium,
## data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3553 -0.5608 0.0685 0.5227 6.2126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.19990 0.19704 -1.015 0.3118
## Cholesterol 0.99365 0.00475 209.183 < 2e-16 ***
## Fibre -1.97531 0.11325 -17.442 < 2e-16 ***
## Protein -0.27442 0.05914 -4.640 7.04e-06 ***
## Calcium 4.12106 1.73566 2.374 0.0187 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.17 on 166 degrees of freedom
## Multiple R-squared: 0.9966, Adjusted R-squared: 0.9965
## F-statistic: 1.217e+04 on 4 and 166 DF, p-value: < 2.2e-16
###Going to revise again and take out Calcium (Significant @ the .05 level) & add back calories
##Refined Model 5
Model_6<-lm(Sugars~Cholesterol+Fibre+Protein+ Calories, Train_SBUX)
summary(Model_6) ###R-Squared=.9965
##
## Call:
## lm(formula = Sugars ~ Cholesterol + Fibre + Protein + Calories,
## data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3547 -0.4882 0.0859 0.5290 6.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.141873 0.198191 -0.716 0.475
## Cholesterol 0.965883 0.018700 51.653 < 2e-16 ***
## Fibre -2.130770 0.080511 -26.466 < 2e-16 ***
## Protein -0.189444 0.036012 -5.261 4.38e-07 ***
## Calories 0.006731 0.004196 1.604 0.111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.181 on 166 degrees of freedom
## Multiple R-squared: 0.9965, Adjusted R-squared: 0.9965
## F-statistic: 1.195e+04 on 4 and 166 DF, p-value: < 2.2e-16
##Refined Model 6
Model_7<-lm(Sugars~Cholesterol+Fibre+Protein+Total_Fat+Trans_Fat, Train_SBUX)
summary(Model_7) ###R-Squared=.9965
##
## Call:
## lm(formula = Sugars ~ Cholesterol + Fibre + Protein + Total_Fat +
## Trans_Fat, data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5032 -0.5277 0.0765 0.4953 6.3331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.181411 0.215063 -0.844 0.400
## Cholesterol 0.995494 0.005366 185.511 < 2e-16 ***
## Fibre -2.187983 0.081768 -26.759 < 2e-16 ***
## Protein -0.142006 0.026962 -5.267 4.28e-07 ***
## Total_Fat 0.017771 0.076874 0.231 0.817
## Trans_Fat -0.042968 0.139006 -0.309 0.758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.193 on 165 degrees of freedom
## Multiple R-squared: 0.9965, Adjusted R-squared: 0.9964
## F-statistic: 9366 on 5 and 165 DF, p-value: < 2.2e-16
##Refined Model 7
Model_8<-lm(Sugars~I(Fibre^2)+Fibre+Cholesterol+Protein+Calcium, Train_SBUX)
summary(Model_8) ###R-Squared=.9967
##
## Call:
## lm(formula = Sugars ~ I(Fibre^2) + Fibre + Cholesterol + Protein +
## Calcium, data = Train_SBUX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4897 -0.5659 0.0341 0.4942 6.4665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.217999 0.195414 -1.116 0.26623
## I(Fibre^2) 0.073778 0.036316 2.032 0.04380 *
## Fibre -2.277312 0.186246 -12.227 < 2e-16 ***
## Cholesterol 0.995824 0.004827 206.319 < 2e-16 ***
## Protein -0.378106 0.077705 -4.866 2.64e-06 ***
## Calcium 7.478941 2.385104 3.136 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.159 on 165 degrees of freedom
## Multiple R-squared: 0.9967, Adjusted R-squared: 0.9966
## F-statistic: 9923 on 5 and 165 DF, p-value: < 2.2e-16
###Will validate Model 1,2,3,5 because they have the highest R-Squared values
We will use our partitioned data to attempt to estimate out of sample error by running the models against our holdout data set.
###Validating Model_1
pred_values<-predict(Model_1, Holdout_SBUX)
#compute in-sample error
E_IN<-sqrt(sum((Model_1$fitted.values-Train_SBUX$Sugars)^2)/(length(Model_1$residuals)-3))
E_IN ##Returns .8999143
## [1] 0.8999143
E_OUT<-sqrt(sum((pred_values-Holdout_SBUX$Sugars)^2)/(length(pred_values)-3))
E_OUT ##Returns 1.063469
## [1] 1.063469
###Validating Model_2
pred_values<-predict(Model_2, Holdout_SBUX)
#compute in-sample error
E_IN<-sqrt(sum((Model_2$fitted.values-Train_SBUX$Sugars)^2)/(length(Model_2$residuals)-3))
E_IN ##Returns 1.038591
## [1] 1.038591
E_OUT<-sqrt(sum((pred_values-Holdout_SBUX$Sugars)^2)/(length(pred_values)-3))
E_OUT ##Returns 1.281684... Model 1, although it is more complicated, returns a lower out of sample error
## [1] 1.281684
###Validating Model_3
pred_values<-predict(Model_3, Holdout_SBUX)
#compute in-sample error
E_IN<-sqrt(sum((Model_3$fitted.values-Train_SBUX$Sugars)^2)/(length(Model_3$residuals)-3))
E_IN ##Returns 1.039376
## [1] 1.039376
E_OUT<-sqrt(sum((pred_values-Holdout_SBUX$Sugars)^2)/(length(pred_values)-3))
E_OUT ##Returns 1.290094
## [1] 1.290094
###Validating Model_5
pred_values<-predict(Model_5, Holdout_SBUX)
#compute in-sample error
E_IN<-sqrt(sum((Model_5$fitted.values-Train_SBUX$Sugars)^2)/(length(Model_5$residuals)-3))
E_IN ##Returns 1.163006
## [1] 1.163006
E_OUT<-sqrt(sum((pred_values-Holdout_SBUX$Sugars)^2)/(length(pred_values)-3))
E_OUT ##Returns 1.563584
## [1] 1.563584
##conclusionary comments: Model_1 has the lowest E_Out value, followed by models 2, 3, 5 in that order
As of right now, Model_1 is the most efficient and is the chosen as the best candidate for regression. While it is our most complex model, it returns the lowest out of sample error when tested against the holdout data set. In the case of sugar in Starbucks drinks, the other variables are all significant enough to have to include in our regression model. When compared to the other models we tested, the out of sample error begins to increase as variables are removed. It follows that we will run the model against our Test_SBUX data–partitioned earlier–so as to get the most accurate prediction of the models out-of-sample error.
pred_values<-predict(Model_1, Test_SBUX)
E_OUT<-sqrt(sum((pred_values-Test_SBUX$Sugars)^2)/(length(pred_values)-3))
E_OUT
## [1] 0.958966
What we see is that Model_1, when ran against the Test_SBUX data, returns an E_Out value of .9589 which is exceptionally low. Therefore, Model_1 is our best linear regression model when trying to explain variations in the amount of sugar in a Starbucks menu item.
We will now be imlementing CART, Random Forests, and SVM models for our predictive analysis in an attempt to classify which beverage category a set of observations is placed into.
SBUX_Data$Beverage_category<-factor(SBUX_Data$Beverage_category)
##sets our classification variable to a factor variable
class(Test_SBUX$Beverage_category)
## [1] "factor"
CART_Model <- train(Beverage_category ~ ., data = Train_SBUX, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10) #increasing tunelength increases regularization penalty
##the "cv", number = 10 refers to 10-fold cross validation on the training data
plot(CART_Model) #produces plot of cross-validation results
CART_Model$bestTune #returns optimal complexity parameter
## cp
## 1 0
confusionMatrix(predict(CART_Model, Test_SBUX), Test_SBUX$Beverage_category) ##Validation
## Confusion Matrix and Statistics
##
## Reference
## Prediction Classic Espresso Drinks Coffee
## Classic Espresso Drinks 8 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 5
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 1
## Reference
## Prediction Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 1
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 3
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 1
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 1
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 3 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 1 0
## Smoothies 1 1
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Tazo® Tea Drinks
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 2
## Smoothies 0
## Tazo® Tea Drinks 5
##
## Overall Statistics
##
## Accuracy : 0.6
## 95% CI : (0.4211, 0.7613)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : 0.0001039
##
## Kappa : 0.5105
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Classic Espresso Drinks Class: Coffee
## Sensitivity 0.8000 NA
## Specificity 0.8800 1
## Pos Pred Value 0.7273 NA
## Neg Pred Value 0.9167 NA
## Prevalence 0.2857 0
## Detection Rate 0.2286 0
## Detection Prevalence 0.3143 0
## Balanced Accuracy 0.8400 NA
## Class: Frappuccino® Blended Coffee
## Sensitivity 0.8333
## Specificity 0.9655
## Pos Pred Value 0.8333
## Neg Pred Value 0.9655
## Prevalence 0.1714
## Detection Rate 0.1429
## Detection Prevalence 0.1714
## Balanced Accuracy 0.8994
## Class: Frappuccino® Blended Crème
## Sensitivity 0.00000
## Specificity 1.00000
## Pos Pred Value NaN
## Neg Pred Value 0.97143
## Prevalence 0.02857
## Detection Rate 0.00000
## Detection Prevalence 0.00000
## Balanced Accuracy 0.50000
## Class: Frappuccino® Light Blended Coffee
## Sensitivity 0.00000
## Specificity 1.00000
## Pos Pred Value NaN
## Neg Pred Value 0.91429
## Prevalence 0.08571
## Detection Rate 0.00000
## Detection Prevalence 0.00000
## Balanced Accuracy 0.50000
## Class: Shaken Iced Beverages
## Sensitivity 0.50000
## Specificity 0.90909
## Pos Pred Value 0.25000
## Neg Pred Value 0.96774
## Prevalence 0.05714
## Detection Rate 0.02857
## Detection Prevalence 0.11429
## Balanced Accuracy 0.70455
## Class: Signature Espresso Drinks Class: Smoothies
## Sensitivity 0.20000 1.00000
## Specificity 0.86667 0.94118
## Pos Pred Value 0.20000 0.33333
## Neg Pred Value 0.86667 1.00000
## Prevalence 0.14286 0.02857
## Detection Rate 0.02857 0.02857
## Detection Prevalence 0.14286 0.08571
## Balanced Accuracy 0.53333 0.97059
## Class: Tazo® Tea Drinks
## Sensitivity 0.7143
## Specificity 0.9643
## Pos Pred Value 0.8333
## Neg Pred Value 0.9310
## Prevalence 0.2000
## Detection Rate 0.1429
## Detection Prevalence 0.1714
## Balanced Accuracy 0.8393
par(xpd=NA)
plot(CART_Model$finalModel)
text(CART_Model$finalModel, digits = 3)
###above command creates a decision tree for the CART_Model
The above confusion matrix gives us an accuracy output of .6 meaning that 60% of the time the CART model places the data into the right beverage category.
#caret package implementation with 3-fold cross validation
Forest_Model <- train(Beverage_category ~ ., method="rf", trControl=trainControl(method = "cv", number = 3), preProcess=c("center", "scale"), data=Train_SBUX)
print(Forest_Model)
## Random Forest
##
## 171 samples
## 15 predictor
## 9 classes: 'Classic Espresso Drinks', 'Coffee', 'Frappuccino® Blended Coffee', 'Frappuccino® Blended Crème', 'Frappuccino® Light Blended Coffee', 'Shaken Iced Beverages', 'Signature Espresso Drinks', 'Smoothies', 'Tazo® Tea Drinks'
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 115, 114, 113
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7716778 0.7268637
## 8 0.8010256 0.7623052
## 15 0.8012272 0.7627555
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.
Note the above Forest_Model, ran with the caret implementation, exhibits a 80.2% accuracy when mtry=15. Now we will run it against the test data to see how well it performs against new data.
confusionMatrix(predict(Forest_Model, Test_SBUX), Test_SBUX$Beverage_category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Classic Espresso Drinks Coffee
## Classic Espresso Drinks 8 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 5
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 3
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 2
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 3 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 1
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Tazo® Tea Drinks
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 7
##
## Overall Statistics
##
## Accuracy : 0.8286
## 95% CI : (0.6635, 0.9344)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : 3.901e-11
##
## Kappa : 0.79
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Classic Espresso Drinks Class: Coffee
## Sensitivity 0.8000 NA
## Specificity 0.8800 1
## Pos Pred Value 0.7273 NA
## Neg Pred Value 0.9167 NA
## Prevalence 0.2857 0
## Detection Rate 0.2286 0
## Detection Prevalence 0.3143 0
## Balanced Accuracy 0.8400 NA
## Class: Frappuccino® Blended Coffee
## Sensitivity 0.8333
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9667
## Prevalence 0.1714
## Detection Rate 0.1429
## Detection Prevalence 0.1429
## Balanced Accuracy 0.9167
## Class: Frappuccino® Blended Crème
## Sensitivity 1.00000
## Specificity 0.97059
## Pos Pred Value 0.50000
## Neg Pred Value 1.00000
## Prevalence 0.02857
## Detection Rate 0.02857
## Detection Prevalence 0.05714
## Balanced Accuracy 0.98529
## Class: Frappuccino® Light Blended Coffee
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.08571
## Detection Rate 0.08571
## Detection Prevalence 0.08571
## Balanced Accuracy 1.00000
## Class: Shaken Iced Beverages
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.05714
## Detection Rate 0.05714
## Detection Prevalence 0.05714
## Balanced Accuracy 1.00000
## Class: Signature Espresso Drinks Class: Smoothies
## Sensitivity 0.40000 1.00000
## Specificity 0.93333 1.00000
## Pos Pred Value 0.50000 1.00000
## Neg Pred Value 0.90323 1.00000
## Prevalence 0.14286 0.02857
## Detection Rate 0.05714 0.02857
## Detection Prevalence 0.11429 0.02857
## Balanced Accuracy 0.66667 1.00000
## Class: Tazo® Tea Drinks
## Sensitivity 1.0
## Specificity 1.0
## Pos Pred Value 1.0
## Neg Pred Value 1.0
## Prevalence 0.2
## Detection Rate 0.2
## Detection Prevalence 0.2
## Balanced Accuracy 1.0
The accuracy of Forest_Model increases to 82.86%.
#random forest package implementation
Forest_Model_2 <- randomForest(Beverage_category ~., Train_SBUX)
print(Forest_Model_2)
##
## Call:
## randomForest(formula = Beverage_category ~ ., data = Train_SBUX)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.54%
## Confusion matrix:
## Classic Espresso Drinks Coffee
## Classic Espresso Drinks 30 2
## Coffee 0 4
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 8 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 23
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 3
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 9
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 3
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 1
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 15
## Signature Espresso Drinks 1
## Smoothies 0
## Tazo® Tea Drinks 0
## Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 8 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 1 0
## Signature Espresso Drinks 16 0
## Smoothies 0 7
## Tazo® Tea Drinks 3 0
## Tazo® Tea Drinks class.error
## Classic Espresso Drinks 0 0.25000000
## Coffee 0 0.00000000
## Frappuccino® Blended Coffee 0 0.04166667
## Frappuccino® Blended Crème 0 0.00000000
## Frappuccino® Light Blended Coffee 1 0.57142857
## Shaken Iced Beverages 0 0.06250000
## Signature Espresso Drinks 2 0.40740741
## Smoothies 0 0.00000000
## Tazo® Tea Drinks 34 0.08108108
The OOB (Out of Box) implementation provides an error rate of 17.54%, implying model accuracy of 82.46%. Now we will run Forest_Model_2 against the testing set.
confusionMatrix(predict(Forest_Model_2, Test_SBUX), Test_SBUX$Beverage_category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Classic Espresso Drinks Coffee
## Classic Espresso Drinks 8 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 5
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 3
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 2
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 3 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 1
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Tazo® Tea Drinks
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 1
## Smoothies 0
## Tazo® Tea Drinks 6
##
## Overall Statistics
##
## Accuracy : 0.8
## 95% CI : (0.6306, 0.9156)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : 4.113e-10
##
## Kappa : 0.7555
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Classic Espresso Drinks Class: Coffee
## Sensitivity 0.8000 NA
## Specificity 0.8800 1
## Pos Pred Value 0.7273 NA
## Neg Pred Value 0.9167 NA
## Prevalence 0.2857 0
## Detection Rate 0.2286 0
## Detection Prevalence 0.3143 0
## Balanced Accuracy 0.8400 NA
## Class: Frappuccino® Blended Coffee
## Sensitivity 0.8333
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9667
## Prevalence 0.1714
## Detection Rate 0.1429
## Detection Prevalence 0.1429
## Balanced Accuracy 0.9167
## Class: Frappuccino® Blended Crème
## Sensitivity 1.00000
## Specificity 0.97059
## Pos Pred Value 0.50000
## Neg Pred Value 1.00000
## Prevalence 0.02857
## Detection Rate 0.02857
## Detection Prevalence 0.05714
## Balanced Accuracy 0.98529
## Class: Frappuccino® Light Blended Coffee
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.08571
## Detection Rate 0.08571
## Detection Prevalence 0.08571
## Balanced Accuracy 1.00000
## Class: Shaken Iced Beverages
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.05714
## Detection Rate 0.05714
## Detection Prevalence 0.05714
## Balanced Accuracy 1.00000
## Class: Signature Espresso Drinks Class: Smoothies
## Sensitivity 0.40000 1.00000
## Specificity 0.90000 1.00000
## Pos Pred Value 0.40000 1.00000
## Neg Pred Value 0.90000 1.00000
## Prevalence 0.14286 0.02857
## Detection Rate 0.05714 0.02857
## Detection Prevalence 0.14286 0.02857
## Balanced Accuracy 0.65000 1.00000
## Class: Tazo® Tea Drinks
## Sensitivity 0.8571
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9655
## Prevalence 0.2000
## Detection Rate 0.1714
## Detection Prevalence 0.1714
## Balanced Accuracy 0.9286
The accuracy of the model decreases to 80%.
SVM1<-svm(Beverage_category~., data = Train_SBUX, cost=1000, cross = 10, gamma=.001)
confusionMatrix(predict(SVM1, Test_SBUX), Test_SBUX$Beverage_category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Classic Espresso Drinks Coffee
## Classic Espresso Drinks 9 0
## Coffee 1 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 0 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 5
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 3
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 2
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 3 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 1
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Tazo® Tea Drinks
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 7
##
## Overall Statistics
##
## Accuracy : 0.8571
## 95% CI : (0.6974, 0.9519)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : 3.071e-12
##
## Kappa : 0.825
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Classic Espresso Drinks Class: Coffee
## Sensitivity 0.9000 NA
## Specificity 0.8800 0.97143
## Pos Pred Value 0.7500 NA
## Neg Pred Value 0.9565 NA
## Prevalence 0.2857 0.00000
## Detection Rate 0.2571 0.00000
## Detection Prevalence 0.3429 0.02857
## Balanced Accuracy 0.8900 NA
## Class: Frappuccino® Blended Coffee
## Sensitivity 0.8333
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9667
## Prevalence 0.1714
## Detection Rate 0.1429
## Detection Prevalence 0.1429
## Balanced Accuracy 0.9167
## Class: Frappuccino® Blended Crème
## Sensitivity 1.00000
## Specificity 0.97059
## Pos Pred Value 0.50000
## Neg Pred Value 1.00000
## Prevalence 0.02857
## Detection Rate 0.02857
## Detection Prevalence 0.05714
## Balanced Accuracy 0.98529
## Class: Frappuccino® Light Blended Coffee
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.08571
## Detection Rate 0.08571
## Detection Prevalence 0.08571
## Balanced Accuracy 1.00000
## Class: Shaken Iced Beverages
## Sensitivity 1.00000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 1.00000
## Prevalence 0.05714
## Detection Rate 0.05714
## Detection Prevalence 0.05714
## Balanced Accuracy 1.00000
## Class: Signature Espresso Drinks Class: Smoothies
## Sensitivity 0.40000 1.00000
## Specificity 1.00000 1.00000
## Pos Pred Value 1.00000 1.00000
## Neg Pred Value 0.90909 1.00000
## Prevalence 0.14286 0.02857
## Detection Rate 0.05714 0.02857
## Detection Prevalence 0.05714 0.02857
## Balanced Accuracy 0.70000 1.00000
## Class: Tazo® Tea Drinks
## Sensitivity 1.0
## Specificity 1.0
## Pos Pred Value 1.0
## Neg Pred Value 1.0
## Prevalence 0.2
## Detection Rate 0.2
## Detection Prevalence 0.2
## Balanced Accuracy 1.0
The SVM1 model produces an accuracy rate of 85.71%, or, contrastingly, a 14.29 out-of-sample error rate when attempting to classify beverage category based on nutritional characteristics.
#tuning the SVM (validation)
svm_tune <- tune(svm, train.x=Train_SBUX[,-1], train.y=Train_SBUX[,1],
kernel="radial", ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))
print(svm_tune) ###printed cott=10 and gamma=.5
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 0.5
##
## - best performance: 0.2689542
#re-estimate the model with the optimally tuned parameters
SVM_RETUNE<-svm(Beverage_category~., data = Train_SBUX, cost=10, cross = 10, gamma=.5)
confusionMatrix(predict(SVM_RETUNE, Test_SBUX), Test_SBUX$Beverage_category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Classic Espresso Drinks Coffee
## Classic Espresso Drinks 7 0
## Coffee 1 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 2 0
## Smoothies 0 0
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Frappuccino® Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 4
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 1
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Blended Crème
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 1
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Frappuccino® Light Blended Coffee
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 1
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 2
## Shaken Iced Beverages 0
## Signature Espresso Drinks 0
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Shaken Iced Beverages
## Classic Espresso Drinks 0
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 1
## Signature Espresso Drinks 1
## Smoothies 0
## Tazo® Tea Drinks 0
## Reference
## Prediction Signature Espresso Drinks Smoothies
## Classic Espresso Drinks 4 0
## Coffee 0 0
## Frappuccino® Blended Coffee 0 0
## Frappuccino® Blended Crème 0 0
## Frappuccino® Light Blended Coffee 0 0
## Shaken Iced Beverages 0 0
## Signature Espresso Drinks 1 0
## Smoothies 0 1
## Tazo® Tea Drinks 0 0
## Reference
## Prediction Tazo® Tea Drinks
## Classic Espresso Drinks 1
## Coffee 0
## Frappuccino® Blended Coffee 0
## Frappuccino® Blended Crème 0
## Frappuccino® Light Blended Coffee 0
## Shaken Iced Beverages 0
## Signature Espresso Drinks 1
## Smoothies 0
## Tazo® Tea Drinks 5
##
## Overall Statistics
##
## Accuracy : 0.6286
## 95% CI : (0.4492, 0.7853)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : 2.555e-05
##
## Kappa : 0.5445
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Classic Espresso Drinks Class: Coffee
## Sensitivity 0.7000 NA
## Specificity 0.8000 0.97143
## Pos Pred Value 0.5833 NA
## Neg Pred Value 0.8696 NA
## Prevalence 0.2857 0.00000
## Detection Rate 0.2000 0.00000
## Detection Prevalence 0.3429 0.02857
## Balanced Accuracy 0.7500 NA
## Class: Frappuccino® Blended Coffee
## Sensitivity 0.6667
## Specificity 0.9655
## Pos Pred Value 0.8000
## Neg Pred Value 0.9333
## Prevalence 0.1714
## Detection Rate 0.1143
## Detection Prevalence 0.1429
## Balanced Accuracy 0.8161
## Class: Frappuccino® Blended Crème
## Sensitivity 1.00000
## Specificity 0.97059
## Pos Pred Value 0.50000
## Neg Pred Value 1.00000
## Prevalence 0.02857
## Detection Rate 0.02857
## Detection Prevalence 0.05714
## Balanced Accuracy 0.98529
## Class: Frappuccino® Light Blended Coffee
## Sensitivity 0.66667
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.96970
## Prevalence 0.08571
## Detection Rate 0.05714
## Detection Prevalence 0.05714
## Balanced Accuracy 0.83333
## Class: Shaken Iced Beverages
## Sensitivity 0.50000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.97059
## Prevalence 0.05714
## Detection Rate 0.02857
## Detection Prevalence 0.02857
## Balanced Accuracy 0.75000
## Class: Signature Espresso Drinks Class: Smoothies
## Sensitivity 0.20000 1.00000
## Specificity 0.83333 1.00000
## Pos Pred Value 0.16667 1.00000
## Neg Pred Value 0.86207 1.00000
## Prevalence 0.14286 0.02857
## Detection Rate 0.02857 0.02857
## Detection Prevalence 0.17143 0.02857
## Balanced Accuracy 0.51667 1.00000
## Class: Tazo® Tea Drinks
## Sensitivity 0.7143
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9333
## Prevalence 0.2000
## Detection Rate 0.1429
## Detection Prevalence 0.1429
## Balanced Accuracy 0.8571
After re-running the SVM model with the optimally tuned parameters, we see an accuracy rate for SVM_RETUNE of 62.86%, a decrease of almost 20% from our orignal SVM1 model.
When performing our analysis for the classification models we used the following methods: CART, Random Forest, and SVM. We ended up running two models of the Random Forest because we wanted to test for a difference between the packages (caret vs random forest). In the end the SVM model was the most accurate, so we are choosing it as the best candidate for classification. That being said, our SVM model was more accurate at its first stage. When we performed the retune command, we were given a model with lower accuracy. This could be a result of the true optimal parameters being outside of the range we gave the model. Following a similar pattern was our (Cohen’s) Kappa value. It was also highest in our SVM model, meaning that this model would still be the most accurate even if it came from random predictions.
Our linear regression model did a solid job of explaining the variation in sugar content for different items on the Starbucks Drink menu. Some variables, such as calcium and fibre, did a better job at explaining the variation than one would expect. Contrastingly, other variables such as Carbs and Calorie content seemed to vary in their explanative ability as we tweaked the regression models.
For our classification model, we see that the Support Vector Machine algorithm does a good job at classifying a row of observations into the correct beverage category.