Implement a machine learning algorithm such that given the input of a beer style (IPA, Lager etc.), the algorithm will output a new recipe for that beer style.
To facilitate this, existing beer brewing recipes have been scraped from brewing websites. This has further been compiled into a data set containing the (i) style, (ii) characteristics, (iii) ingredients and (iv) instructions on how to brew each beer. The dataset provided here is a sample of the collected dataset. Two important assumptions for the dataset are:
The dataset contains enough recipes to train a reliable model.
The features (variables) in the dataset are sufficient to replicate each beer by the brewer.
Give an overview of your model’s architecture. Justify why this specific architecture is recommended.
Provide a description of the training dataset that you will use to train your model. This may include additional features and/or a different structure to the sample dataset that was provided. Justify your approach.
Briefly discuss the training algorithm that would be used to train your model.
Answer: I would compare the performance of several algorithms such as logistic regression, support vector machines, random forests in the scikit-learn package in python. I am currently only familiar with single layer neural nets, which is why I don’t consider a deep learning algorithm.
Answer: It is unlikely that we would be able to find the patterns in the style of the beer with underrepresented data. Hence, a beer recipe generated by the algorithm won’t necessarily taste like a beer of that style.
In case the consequences in question (3) are deemed to be detrimental to your model’s performance, explain how you would attempt to mitigate the effects.
Discuss the performance metrics that would be used to evaluate your model. Also explain the effects of the class distribution problem in question (3) on each of the performance metrics.
Answer: The two metrics that I am familiar with are micro and macro averaging. For imbalances in available data for each beer style, macro averaging is a good fit since it is weighted to put each class on equal footing.
Explain why a deep learning approach was not considered, or deemed unsuitable for the task.
Briefly discuss the differences (in terms of predictive modelling) between an image classification task and a non-image classification task (i.e. numeric and/or categorical data).
Answer: Numeric data, such as weight forms an ordered set. Categorical data, such as the colors red, blue and green cannot be ordered. Several options exist in python to deal with categorical or nominal variable. One method is to create dummy variables and assign binary value to it. So, instead of having red, blue or green under the variables color, we can create dummy variable color_red, color_blue, color_green and assign them 1 or 0 value.
Something that I was not aware of and will explore in the future.
Let us first load the data and look at all the column names.
# Load csv format data from file
file_location <- 'C:\\Users\\Windows\\Dropbox\\AllStuff\\Beer_Brewing_Problem\\Data\\beerrecipe.csv'
beer_data <- read.csv(file_location)
# Columns of the data frame
names(beer_data) <- tolower(names(beer_data))
names(beer_data)
## [1] "name" "orig_gravity"
## [3] "final_gravity" "abv"
## [5] "ibu" "srm"
## [7] "style" "fermentable1"
## [9] "fermentable2" "fermentable3"
## [11] "fermentable4" "f_amount1"
## [13] "f_amount2" "f_amount3"
## [15] "f_amount4" "firstworthops1"
## [17] "firstworthops2" "fwh_amount1"
## [19] "fwh_amount2" "boilhops1"
## [21] "boilhops2" "boilhops3"
## [23] "boilhops4" "bh_amount1"
## [25] "bh_amount2" "bh_amount3"
## [27] "bh_amount4" "boil_time1"
## [29] "boil_time2" "boil_time3"
## [31] "boil_time4" "dryhop1"
## [33] "dryhop2" "dryhop_time1"
## [35] "dryhop_time2" "mashtype"
## [37] "mashamount" "mashtime"
## [39] "mashtemp" "yeast"
## [41] "yeastattenuation" "yeasttemp"
## [43] "fermentationtemperature"
The beer names are irrelevant and can be eliminated. Let us look at and extract the various styles of the beer. These would be the discrete class labels for each recipe. We can see that there are missing values that need to be imputed. In the American Lager data below, fermentation temperature is missing.
unique(beer_data$style)
## [1] American IPA Sweet Stout
## [3] Special/Best/Premium Bitter Belgian Pale Ale
## [5] American Lager Irish Red Ale
## [7] English IPA Blonde Ale
## [9] American Barleywine Belgian Tripel
## 10 Levels: American Barleywine American IPA ... Sweet Stout
beer_data[beer_data$style == "American Lager", ]
## name orig_gravity final_gravity abv ibu srm style
## 6 American Lager 1.046 1.012 4.54 24.13 2.77 American Lager
## 7 Aus Lager 1.042 1.007 4.68 25.43 2.98 American Lager
## fermentable1 fermentable2 fermentable3 fermentable4 f_amount1
## 6 American - Pilsner Flaked Rice 3500
## 7 American - Pilsner American - Wheat 3800
## f_amount2 f_amount3 f_amount4 firstworthops1 firstworthops2 fwh_amount1
## 6 1000 NA NA NA
## 7 400 NA NA NA
## fwh_amount2 boilhops1 boilhops2 boilhops3 boilhops4 bh_amount1
## 6 NA Galena 15
## 7 NA Pride of Ringwood 20
## bh_amount2 bh_amount3 bh_amount4 boil_time1 boil_time2 boil_time3
## 6 NA NA NA 60 NA NA
## 7 NA NA NA NA NA NA
## boil_time4 dryhop1 dryhop2 dryhop_time1 dryhop_time2 mashtype mashamount
## 6 NA Infusion 11l
## 7 NA 28l
## mashtime mashtemp
## 6 60 68
## 7 90 64
## yeast yeastattenuation
## 6 DCL Yeast S-189 - SafLager German Lager 75.00%
## 7 Fermentis / Safale - Saflager - German Lager Yeast S-23 82.00%
## yeasttemp fermentationtemperature
## 6 19-22 NA
## 7 8.9-22 15
Step 1: Re-shape dataset.
beer_data[, c("fermentable1", "f_amount1")]
## fermentable1 f_amount1
## 1 American - Pale 2-Row 512
## 2 American - Pale 2-Row 4990
## 3 Canadian - Pale 2-Row 2950
## 4 United Kingdom - Maris Otter Pale 2400
## 5 German - Wheat Malt 3630
## 6 American - Pilsner 3500
## 7 American - Pilsner 3800
## 8 German - Pilsner 3000
## 9 United Kingdom - Maris Otter Pale 4990
## 10 American - Pale 2-Row 4130
## 11 United Kingdom - Maris Otter Pale 6580
## 12 American - Caramel / Crystal 60L 100
## 13 Belgian - Pilsner 6000
I would change the format of the data set as mentioned earlier by creating dummy variables fermentable1_AmericanPale2Row, fermentable1_CanadianPale2Row etc. with the amount used directly under it. The amount would be 0 if the ingredient was not used. This would make training the model more convenient. Although perhaps there other ways of getting around the fact that the fermentable1 and f_amount1 are decoupled as it is.
Step 2: Train various predictive models: I would test Random Forests, Support vector machines, or Logistic Regression.
Step 3: Test model using macro/micro-average metrics.
Step 4: Decide on a model based on performance.
Step 1: Use L1 regularization in conjunction with Logistic Regression to find out which features are essential to that beer style. Find set of variables outside of the essential features that have been used at one time or another in the same style of beer.
Step2: For essential numerical values - generate a random number from a normal distribution centered around the mean of the training data set and the corresponding standard deviation.
Step 3: For the categorical values - The amount of the variable would be determined by a normal distribution with standard deviation determined from the training data.
Step 4: For each non-essential but previously present items - use a binary random variable to determine whether to include the ingredient or not followed by a normally distributed random number about the mean with the corresponding standard deviation. One this phase is complete, we will have a suggestion for a beer recipe of that style.
Step 5: The next step is to have the machine learning algorithm predict the class of this beer to check that it passes the test.
Make the beer and taste it.