Official Cookie Overview

Author

Julian (VP) & Julia (President) from Montgomery College Data Science Club

Welcome to our Data Science Club Project Page!!!!!!!

Join our club! We meet on Thursdays at 3pm in SW 304 (Rockville Campus science building). Here is a link to our club Discord and GitHub repository (most datasets are on this fork right now).

Our Spring semester project involves working with a dataset of about 10,000 cookie recipes scraped from the internet. We’re done with the web scraping but happy to share how we did it if you’re curious.

Our goal is to analyze patterns in cookie recipes and correlations between ingredients, quantity of ingredients, and cookie recipe ratings. Ultimately we want to make a statistical model that will predict a cookie recipe’s quality based on its ratings, so we can make the best, most average, and worst cookie recipes we can come up with. We can then make the cookies and take a survey on people’s opinions of the cookies.

Right now we’re on the data preparation (data cleaning) phase. We need help! More information will be coming, but for right now, please come to the club or join our Discord if you’re interested!  You can use any coding language you want. We also need to peer review the code that gets used to clean the data to make sure it’s being done right (which is another thing we need help with)

Informal AI Guideline

Because we want to make sure that we’re cleaning the data properly, and since this project is just for fun, we prefer that people avoid using AI excessively. We have no problems with using it as a tool or sharing AI generated code to adjust, but please avoid using it to write your code for you. If you send AI generated scripts, just say they’re AI generated. This is just informal so there’s obviously no “punishment” for violating it, just please don’t do that, it’s a bit annoying

Data

Overall, we have 39 datasets and a total of 11294 rows. Most datasets have about 200 rows. We’re going to need to clean nearly all the columns and filter out rows that aren’t actually cookie recipes. This document includes descriptions of every individual data source/dataset, but almost all of the cleaning will be done on one combined dataset.

Overall issues

All (or nearly all) datasets have these columns (full definition on GitHub):

title author rating ratingnum prep cook total yield totalingredients ingredient1
Choc Chip Sally 4.7 1813 15 mins 20 mins 2 hrs 5 mins 26 cookies 10 2 stick butter

Some of them have additional columns like genre, course, cuisine, number of steps, and list of steps. The “ingredientX” columns go up to the total number of ingredients. Some also include date posted/updated/etc - the time columns don’t need to be cleaned, they’re only for reference while working with the data

Here are some issues:

First and foremost, my computer crashed when trying to join all the datasets together, so we’ll have to figure out how to do that 🙂.

  1. Ingredients that are not ingredients: Some (probably most) datasets have things like “filling:”, “optional”, “shredded” etc included in their ingredientX columns. These are also added to the totalingredients rows for these recipes.
    How to fix this:
  • Remove everything that lacks numbers and comb through what’s removed (because some of them say “salt” or “sprinkles”, which many recipe authors don’t give measurements to)
  • Remove everything that ends with “:” (ingredients[!grepl(":$", ingredients)] is one way to do that - literally read as “remove every ingredient that ends with :”, :$ is regex meaning “ends with :”)
  • Make this an if statement or whatever and subtract 1 from totalingredients for every ingredient removed (and then shift everything back in the ingredientX columns)
  1. Empty ingredient columns in the middle of a row: Some datasets (like Cookie Rookie) have empty or NA ingredientX columns in the middle of a row, so you’ll have ingredient1-5 with actual ingredients, and then NA or empty space in ingredients6-9, and then ingredients10-14, for example.
  2. Different sources use different terms for the same ingredients. Eg, “caster sugar” vs “superfine sugar”, “all purpose flour” vs “white flour”. The AllRecipes dataset has a standard way that they refer to ingredients (thankfully) so that’s not a concern for that dataset (and we might want to follow the standard for AllRecipes when cleaning everything else)
  3. Fractions are sometimes written as symbols in the ingredients, and all needs to be turned to decimals (Julia can fix this.)
  4. Measurements need to be converted to grams. Most recipes use cups, tablespoons, etc. Here is an extensive ingredient conversion chart from King Arthur Baking and here’s a Python ingredient conversion program on GitHub that I haven’t looked at
  5. A lot of recipes are NOT ACTUALLY COOKIES!!!!!!!!!
    How to fix this:
  • Separate datasets into data with a category column (many of them have a column that says if it’s a cookie or something else) and datasets without
  • In the datasets without, filter recipes that have cookie words in the title (what I wrote before was grepl("biscuit|ookie|shortbread|bars|snickerdoodle|dodgers|biscotti", title, ignore.case = TRUE), although “biscuit” is risky. “ookie” includes both cookie and brookie and other puns - including pizookie)
  • Search through what is removed and search for sugar, molasses, honey, sweetener etc and pick out actual cookie recipes that were removed (most cookies will appear just under sugar, but some cookies have honey or molasses and not sugar, eg the greek cookie moustokouloura, which isn’t called an immediately recognizable “cookie” word)
  1. Measurement and ingredient will need to be separated in some way for modeling
  2. Not every dataset has prep/cook/total time set up properly. Some websites used the same html element to refer to different things, so some datasets have the label prep/cook/etc time in the rows. These need to be turned into columns and rows (we managed to do something like this before, so we can do it again.)
  3. The “yield” or “servings” columns in different datasets are different and these columns are very unique (eg, “servings: 28 cookies”, vs “servings: 28”, vs “servings: 4 5-cookie servings”)
  4. Some recipes MIGHT be AI. I don’t think so, but I have encountered AI cookie recipes while working on this. It would be great if someone could click through the links to data sources below and make sure they don’t see any clearly AI generated recipes in there. If you find anything like that, send it in the club Discord!

One of the goals is to shrink every recipe down into a 1-cookie recipe by dividing the ingredients by the listed yield, but bringing every recipe to approximately the same weight might also be useful

Original Datasets

Most cleaning will be done to all the data joined into one dataset, but the problems with each are detailed here. Some parts of the cleaning may also be easier to do individually (like filtering for non cookie recipes or fixing ingredients columns - maybe)

Combined dataset