Julian (VP) & Julia (President) from Montgomery College Data Science Club
Welcome to our Data Science Club Project Page!!!!!!!
Join our club! We meet on Thursdays at 3pm in SW 304 (Rockville Campus science building). Here is a link to our club Discord and GitHub repository (most datasets are on this fork right now).
Our Spring semester project involves working with a dataset of about 10,000 cookie recipes scraped from the internet. We’re done with the web scraping but happy to share how we did it if you’re curious.
Our goal is to analyze patterns in cookie recipes and correlations between ingredients, quantity of ingredients, and cookie recipe ratings. Ultimately we want to make a statistical model that will predict a cookie recipe’s quality based on its ratings, so we can make the best, most average, and worst cookie recipes we can come up with. We can then make the cookies and take a survey on people’s opinions of the cookies.
Right now we’re on the data preparation (data cleaning) phase. We need help! More information will be coming, but for right now, please come to the club or join our Discord if you’re interested! You can use any coding language you want. We also need to peer review the code that gets used to clean the data to make sure it’s being done right (which is another thing we need help with)
Informal AI Guideline
Because we want to make sure that we’re cleaning the data properly, and since this project is just for fun, we prefer that people avoid using AI excessively. We have no problems with using it as a tool or sharing AI generated code to adjust, but please avoid using it to write your code for you. If you send AI generated scripts, just say they’re AI generated. This is just informal so there’s obviously no “punishment” for violating it, just please don’t do that, it’s a bit annoying
Data
Overall, we have 39 datasets and a total of 11294 rows. Most datasets have about 200 rows. We’re going to need to clean nearly all the columns and filter out rows that aren’t actually cookie recipes. This document includes descriptions of every individual data source/dataset, but almost all of the cleaning will be done on one combined dataset.
Overall issues
All (or nearly all) datasets have these columns (full definition on GitHub):
title
author
rating
ratingnum
prep
cook
total
yield
totalingredients
ingredient1
Choc Chip
Sally
4.7
1813
15 mins
20 mins
2 hrs 5 mins
26 cookies
10
2 stick butter
Some of them have additional columns like genre, course, cuisine, number of steps, and list of steps. The “ingredientX” columns go up to the total number of ingredients. Some also include date posted/updated/etc - the time columns don’t need to be cleaned, they’re only for reference while working with the data
Here are some issues:
First and foremost, my computer crashed when trying to join all the datasets together, so we’ll have to figure out how to do that 🙂.
Ingredients that are not ingredients: Some (probably most) datasets have things like “filling:”, “optional”, “shredded” etc included in their ingredientX columns. These are also added to the totalingredients rows for these recipes.
How to fix this:
Remove everything that lacks numbers and comb through what’s removed (because some of them say “salt” or “sprinkles”, which many recipe authors don’t give measurements to)
Remove everything that ends with “:” (ingredients[!grepl(":$", ingredients)] is one way to do that - literally read as “remove every ingredient that ends with :”, :$ is regex meaning “ends with :”)
Make this an if statement or whatever and subtract 1 from totalingredients for every ingredient removed (and then shift everything back in the ingredientX columns)
Empty ingredient columns in the middle of a row: Some datasets (like Cookie Rookie) have empty or NA ingredientX columns in the middle of a row, so you’ll have ingredient1-5 with actual ingredients, and then NA or empty space in ingredients6-9, and then ingredients10-14, for example.
Different sources use different terms for the same ingredients. Eg, “caster sugar” vs “superfine sugar”, “all purpose flour” vs “white flour”. The AllRecipes dataset has a standard way that they refer to ingredients (thankfully) so that’s not a concern for that dataset (and we might want to follow the standard for AllRecipes when cleaning everything else)
Fractions are sometimes written as symbols in the ingredients, and all needs to be turned to decimals (Julia can fix this.)
A lot of recipes are NOT ACTUALLY COOKIES!!!!!!!!!
How to fix this:
Separate datasets into data with a category column (many of them have a column that says if it’s a cookie or something else) and datasets without
In the datasets without, filter recipes that have cookie words in the title (what I wrote before was grepl("biscuit|ookie|shortbread|bars|snickerdoodle|dodgers|biscotti", title, ignore.case = TRUE), although “biscuit” is risky. “ookie” includes both cookie and brookie and other puns - including pizookie)
Search through what is removed and search for sugar, molasses, honey, sweetener etc and pick out actual cookie recipes that were removed (most cookies will appear just under sugar, but some cookies have honey or molasses and not sugar, eg the greek cookie moustokouloura, which isn’t called an immediately recognizable “cookie” word)
Measurement and ingredient will need to be separated in some way for modeling
Not every dataset has prep/cook/total time set up properly. Some websites used the same html element to refer to different things, so some datasets have the label prep/cook/etc time in the rows. These need to be turned into columns and rows (we managed to do something like this before, so we can do it again.)
The “yield” or “servings” columns in different datasets are different and these columns are very unique (eg, “servings: 28 cookies”, vs “servings: 28”, vs “servings: 4 5-cookie servings”)
Some recipes MIGHT be AI. I don’t think so, but I have encountered AI cookie recipes while working on this. It would be great if someone could click through the links to data sources below and make sure they don’t see any clearly AI generated recipes in there. If you find anything like that, send it in the club Discord!
One of the goals is to shrink every recipe down into a 1-cookie recipe by dividing the ingredients by the listed yield, but bringing every recipe to approximately the same weight might also be useful
Original Datasets
Most cleaning will be done to all the data joined into one dataset, but the problems with each are detailed here. Some parts of the cleaning may also be easier to do individually (like filtering for non cookie recipes or fixing ingredients columns - maybe)
Air Fryer Brownies Air Fryer Chocolate Chip Cookies
1 1
Bagels & Doughnuts Cake & Cupcakes
1 3
Candy Christmas
2 10
Cookies, Brownies, & Bars Easter
141 2
Father's Day Halloween
1 4
Keto Peanut Butter Cookies No Bake Desserts
1 6
One Pan Desserts Quick & Easy
2 1
Thanksgiving Trifles & Parfaits
1 1
Valentine's Day
1
Cookies, brownies, and bars. Scraped from the cookie/brownie/bar section: https://amandascookin.com/category/recipes/desserts/cookies-brownies-and-bars
Should all be cookie recipes. Scraped from the offiical cookie section.
title rating ratingnum numberofsteps
Length:359 Min. :3.000 Min. : 1.00 Min. : 6.00
Class :character 1st Qu.:4.000 1st Qu.: 13.00 1st Qu.:11.00
Mode :character Median :4.500 Median : 23.00 Median :12.00
Mean :4.333 Mean : 60.79 Mean :13.34
3rd Qu.:4.500 3rd Qu.: 54.50 3rd Qu.:14.00
Max. :5.000 Max. :2611.00 Max. :34.00
totalingredients
Min. : 6.00
1st Qu.:27.00
Median :30.00
Mean :31.53
3rd Qu.:36.00
Max. :63.00
Notes
1. All ingredients & steps are repeated three (3) times!
2. “Author” column needs brief cleaning; formatted like this: Watch Video_By_Charles Kelsey_Staff Pick_Comments
3. Columns time and yield also formatted oddly. This website doesn’t have prep/cook/total like most. Column time is basically total time but some mention cooling, eg, “1 hour, plus 20 minutes cooling” or “1 hour, plus 9 hours chilling and cooling”
Should all be cookies. Scraped from the cookie category: https://www.amummytoo.co.uk/category/cookies
title rating ratingnum numberofsteps
Length:60 Min. :4.500 Min. : 1.000 Min. : 7.00
Class :character 1st Qu.:5.000 1st Qu.: 1.000 1st Qu.: 9.00
Mode :character Median :5.000 Median : 1.000 Median :10.00
Mean :4.909 Mean : 2.389 Mean :11.47
3rd Qu.:5.000 3rd Qu.: 2.000 3rd Qu.:13.00
Max. :5.000 Max. :17.000 Max. :22.00
NA's :6 NA's :6
totalingredients
Min. : 6.00
1st Qu.:14.00
Median :16.00
Mean :17.13
3rd Qu.:20.00
Max. :34.00
Bread, Cookies Breakfast
1 2
cakes and bakes Cookies
2 32
Cookies, Dessert, Snack Cookies, Dessert, Snacks
1 1
Cookies, Desserts and sweet treats Cookies, Easter
2 5
Cookies, Festive makes Cookies, halloween
2 1
Dessert, Snack Desserts and sweet treats
3 3
Festive makes Snack
1 3
Snacks
1
American American, British Austrian British
40 1 1 11
Italian Scottish
2 5
rating ratingnum numberofsteps totalingredients
Min. :3.000 Min. : 1.00 Min. : 2.000 Min. : 5.0
1st Qu.:4.740 1st Qu.: 3.00 1st Qu.: 6.000 1st Qu.:10.0
Median :4.920 Median : 6.00 Median : 8.000 Median :12.0
Mean :4.811 Mean : 19.36 Mean : 8.144 Mean :12.5
3rd Qu.:5.000 3rd Qu.: 15.00 3rd Qu.:10.000 3rd Qu.:14.0
Max. :5.000 Max. :441.00 Max. :27.000 Max. :23.0
American American, Baking, Chocolate, Cookies
47 1
American, Baking, Cookies Baking, Chocolate, Cookies
3 4
Baking, Chocolate, Cookies, Italian Baking, Cookies
1 2
Baking, Cookies, Lemon Candy, Chocolate
1 1
Cheesecake Chocolate
1 1
Cookies French
40 2
Italian
1
Breakfast Dessert
3 102
Should all be cookies. Scraped from the cookie section: https://bakerbynature.com/recipe-index/?_desserts=cookies
NOT all recipes are cookies. Website was scraped from a search for “cookies”.
title rating ratingnum totalingredients
Length:298 Min. :2.500 Min. :5 Min. : 1.00
Class :character 1st Qu.:4.000 1st Qu.:5 1st Qu.: 7.00
Mode :character Median :4.500 Median :5 Median : 9.00
Mean :4.354 Mean :5 Mean : 9.57
3rd Qu.:4.800 3rd Qu.:5 3rd Qu.:11.00
Max. :5.000 Max. :5 Max. :31.00
NA's :21 NA's :21
American American, Asian
55 7
American, Asian, Filipino American, Asian, French
1 1
American, Asian, Japanese American, Filipino
2 2
American, Filipino, French American, French
1 1
Asian Asian, Filipino
4 1
Asian, French Filipino
1 4
Dessert
80
Should all be cookies. Scraped from the cookie category: https://bitesbybianca.com/category/dessert/cookies
title rating ratingnum totalingredients
Length:137 Length:137 Min. : 1.00 Min. : 3.0
Class :character Class :character 1st Qu.: 1.25 1st Qu.:10.0
Mode :character Mode :character Median : 4.00 Median :11.0
Mean : 10.09 Mean :11.9
3rd Qu.: 10.00 3rd Qu.:14.0
Max. :281.00 Max. :26.0
NA's :31
Should all be cookies. Scraped from cookie section: https://www.bostongirlbakes.com/category/cookies
ingredients were taken incorrectly, so there will be non-ingredients in the ingredientX columns, and they will add to the totalingredients columns.
there is no NA in the ratingnum column. instead, where rating has NA, ratingnum has 0. this is actually ideal, most of the datasets don’t have this, they just have NA for both.
rating ratingnum servings numberofsteps totalingredients
Min. :3.670 Min. : 1.00 Min. :24 Min. : 4.000 Min. : 5.00
1st Qu.:4.910 1st Qu.: 7.00 1st Qu.:24 1st Qu.: 7.000 1st Qu.: 9.00
Median :5.000 Median :11.00 Median :24 Median : 9.000 Median :11.00
Mean :4.897 Mean :14.25 Mean :24 Mean : 9.731 Mean :11.23
3rd Qu.:5.000 3rd Qu.:17.00 3rd Qu.:24 3rd Qu.:11.000 3rd Qu.:13.00
Max. :5.000 Max. :72.00 Max. :24 Max. :27.000 Max. :20.00
NA's :92
American American, International Austrian
3 1 2
Brazilian British, Scottish Eastern European
1 1 1
Est European Filipino French
2 1 4
French, International Greek Greek, International
1 6 2
International Italian Portuguese
21 3 2
Romanian Russian South American, Spanish
1 1 1
Spanish Swiss Turkish
1 1 3
Venezuelan
1
Should all be cookies. Scraped from cookie section: https://www.chefspencil.com/recipe-courses/dessert/cookies
“number of steps” seems like it might be inaccurate (although this applies to any datasets with the number of steps)
rating ratingnum numberofsteps totalingredients
Min. :4.000 Min. : 1.00 Min. : 1.000 Min. : 2.00
1st Qu.:4.980 1st Qu.: 8.00 1st Qu.: 2.000 1st Qu.: 7.00
Median :5.000 Median : 26.50 Median : 3.000 Median :12.00
Mean :4.963 Mean : 61.53 Mean : 3.336 Mean :11.89
3rd Qu.:5.000 3rd Qu.: 70.50 3rd Qu.: 4.000 3rd Qu.:16.00
Max. :5.000 Max. :610.00 Max. :10.000 Max. :34.00
NA's :2 NA's :2
Unique items in course
Appetizer, hors d’oeuvres Appetizer, Main Course, Side Dish, Soup
1 1
Appetizer, Side Dish, Snack Appetizer, Snack
1 3
Beverages, Drinks bread, Side Dish
1 1
Breakfast Breakfast, Brunch
4 1
Breakfast, Brunch, Dessert Breakfast, Brunch, Main Course
1 1
Breakfast, Dessert Breakfast, Dessert, Snack
2 1
Brunch, Dessert, Lunch Candy, condiment, Dessert
1 1
Candy, condiment, Ingredient Candy, condiment, Snack
1 1
Candy, Dessert Candy, Dessert, Snack
1 1
condiment condiment, dip, Dressing, Sauce
10 1
condiment, dip, glaze condiment, Seasoning
1 1
condiment, Seasoning Blend condiment, Syrup
1 1
Dessert Dessert, Ingredient
41 1
Dessert, Main Course Dessert, Snack
2 2
Dessert, Tea Dog Treats
2 1
Entree, Main Dish Entree, Main Dish, Soup
1 1
Ingredient Main Course
2 4
Main Dish Main Dish, Stew
2 1
Salad Salad, Side Dish
2 1
Salad, Side Dish, Starter Seasoning Blend, Spice Mix
1 1
Seasonings Side Dish
1 2
Side Dish, vegetables Soup
1 4
Unique items in cuisine
All All, American American
4 1 24
American, Italian American, Southern Asian, Chinese
1 1 1
Australian, New Zealand Austrian, French, German Austrian, German
1 1 4
Austrian, German, Italian British British, Cornish
1 2 1
British, english British, Scottish Chinese
4 1 1
danish english French
1 2 5
French, German, Italian French, Italian French, Provencal
1 1 1
German German, Italian German, Swiss
20 1 1
Greek Greek, Sephardic Jewish Indian
2 1 1
Italian Jewish Mexican
2 1 2
Middle Eastern Portuguese Welsh
1 1 1
Not all cookies. Scrape off search results for “cookie” on the website https://www.daringgourmet.com/?s=cookie
Ingredients column includes things like “Day#:” (for multiple day long recipes) as their own ingredient. This will add to the totalingredients columns for rows with that. Filter for things adding in : to remove these.
“number of steps” is incorrect on a decent amount of recipes because of weird website formatting
Filtering title by ‘cookie’ might not work with this dataset, but filtering by searching for “cookie” in any row might (some of the titles say “Authentic Pfeffernüsse”, but pfeffernüsse is a cookie)
rating ratingnum actualsteps ingredient1
Min. :2.000 Min. : 1.000 Length:704 Length:704
1st Qu.:5.000 1st Qu.: 2.000 Class :character Class :character
Median :5.000 Median : 2.000 Mode :character Mode :character
Mean :4.837 Mean : 3.822
3rd Qu.:5.000 3rd Qu.: 5.000
Max. :5.000 Max. :31.000
NA's :411 NA's :411
American American, Austrian American, Danish American, Dutch
673 1 1 1
American, French American, Mexican Cookies Danish
1 1 2 2
dog treats French German Italian
1 4 1 2
Scottish
1
SHOULD be all cookie recipes, because the website only has cookie recipes, but this was scraped off the website’s search results/recipe index. Also some of these are dog cookies - ingredients scraped properly - steps included
- very few recipes have ratings
- empty “customtime” column (remove)
rating ratingnum totalingredients
Min. :4.000 Min. : 1.00 Min. : 8.00
1st Qu.:4.670 1st Qu.: 2.00 1st Qu.:10.00
Median :4.880 Median : 4.00 Median :11.00
Mean :4.791 Mean : 10.36 Mean :11.28
3rd Qu.:5.000 3rd Qu.: 8.00 3rd Qu.:12.00
Max. :5.000 Max. :105.00 Max. :16.00
Macarons Madeleines
10 3
American Austrian French German Italian
8 1 14 1 1
title rating ratingnum numberofsteps
Length:24 Min. :4.670 Min. :1.00 Min. : 3.00
Class :character 1st Qu.:5.000 1st Qu.:1.75 1st Qu.: 6.00
Mode :character Median :5.000 Median :2.00 Median : 8.50
Mean :4.967 Mean :2.70 Mean : 8.25
3rd Qu.:5.000 3rd Qu.:3.00 3rd Qu.:10.00
Max. :5.000 Max. :8.00 Max. :14.00
NA's :4 NA's :4
totalingredients
Min. : 4.00
1st Qu.: 7.75
Median :11.00
Mean :10.46
3rd Qu.:12.00
Max. :16.00
Contains cookies without “cookie” in title (“Sitto’s Date Ma’amoul”)
rating ratingnum totalingredients
Min. :4.47 Min. : 1.00 Min. : 2.00
1st Qu.:4.99 1st Qu.: 3.00 1st Qu.:11.00
Median :5.00 Median : 8.00 Median :12.00
Mean :4.97 Mean : 27.08 Mean :12.61
3rd Qu.:5.00 3rd Qu.: 18.00 3rd Qu.:14.00
Max. :5.00 Max. :1122.00 Max. :27.00
rating ratingnum totalingredients
Min. :4.670 Min. : 1.0 Min. : 2.00
1st Qu.:4.960 1st Qu.: 19.0 1st Qu.: 8.25
Median :4.990 Median : 70.0 Median :10.00
Mean :4.968 Mean : 372.5 Mean :10.88
3rd Qu.:5.000 3rd Qu.: 238.0 3rd Qu.:13.00
Max. :5.000 Max. :6053.0 Max. :22.00
NA's :5 NA's :5
American American, British American, english, French
106 2 1
American, French American, Greek American, Italian
2 1 1
American, russian Austrian British
1 1 1
British, Scottish French German
3 3 4
Greek Italian Jewish
1 7 1
latin Mexican Mexican, Swedish
1 1 1
Notes
- The ingredients weren’t scraped properly, so some of them are “sections”, or bits of html eg:
Cookies / For Decorating / Optional Topping Before Baking / Topping / Easy Icing / Rolling / Maple Icing / /wp:list
An easy way to filter for these will be to isolate ingredients that don’t have any numbers in them, but there’s also ingredients like “Assorted sprinkles” that have no digits and are ingredients
⚠️ Every non-ingredient will have added 1 to totalingredients for that recipe
That will need to be fixed (either by subtracting 1 every time you remove an ingredient or by redoing totalingredients after removing all non-ingredients) ⚠️ The ingredientX columns will need to be shifted when ingredients are removed from a recipe
Code used to fix this for another recipe should be usable for this too
Should all be cookies. Scraped from the cookie section: https://www.tasteofhome.com/recipes/dishes-beverages/cookies
Notes:
1. Has both “author” and “recipe submitter” (and tester, but that column can be removed)
2. The “time” and “yield” categories are off. This is what the first time column looks like: Total Time:Prep: 15 min. Bake: 10 min./batch , this is what the second time column looks like: Yield:about 5 dozen Prep:15 min Cook:10 min . “Total time” wasn’t a number given on the recipes, so it will have to be calculated by adding together the other time values.
rating ratingnum servings numberofsteps
Min. :4.000 Min. : 1.00 Min. : 4.00 Min. : 5.000
1st Qu.:5.000 1st Qu.: 5.00 1st Qu.:16.00 1st Qu.: 7.000
Median :5.000 Median : 9.00 Median :22.00 Median : 9.000
Mean :4.986 Mean : 17.12 Mean :20.72 Mean : 9.198
3rd Qu.:5.000 3rd Qu.: 18.00 3rd Qu.:24.00 3rd Qu.:10.000
Max. :5.000 Max. :178.00 Max. :40.00 Max. :26.000
NA's :12 NA's :12
totalingredients
Min. : 4.00
1st Qu.: 9.00
Median :11.00
Mean :10.94
3rd Qu.:12.00
Max. :39.00
American American, Austrian
65 1
American, Austrian, French American, European
2 1
American, European, French American, French
1 2
American, German American, Italian
2 1
American, Jewish American, Mexican
1 1
American, Middle Eastern Austrian
2 1
British, European, UK European
1 14
European, Scottish French
1 8
French, Italian Indian
1 1
International Israeli
8 3
Italian Mediterranean, Spanish
2 2
Middle Eastern Scottish
3 2