Julian (VP) & Julia (President) from MC Data Science Club
Welcome to our Data Science Club Project Page!!!!!!!
Join our club! We meet on Thursdays at 3pm in SW 304 (Rockville Campus science building). Here is a link to our club Discord and GitHub repository (most datasets are on this fork right now). (open links in new tab)
Our Spring semester project involves working with a dataset of about 10,000 cookie recipes scraped from the internet. We’re done with the web scraping but happy to share how we did it if you’re curious.
Our goal is to analyze patterns in cookie recipes and correlations between ingredients, quantity of ingredients, and cookie recipe ratings. Ultimately we want to make a statistical model that will predict a cookie recipe’s quality based on its ratings, so we can make the best, most average, and worst cookie recipes we can come up with. We can then make the cookies and take a survey on people’s opinions of the cookies.
Right now we’re on the data preparation (data cleaning) phase. We need help! More information will be coming, but for right now, please come to the club or join our Discord if you’re interested! You can use any coding language you want. We also need to peer review the code that gets used to clean the data to make sure it’s being done right (which is another thing we need help with)
Informal AI Guideline
Because we want to make sure that we’re cleaning the data properly, and since this project is just for fun, we prefer that people avoid using AI excessively. We have no problems with using it as a tool or sharing AI generated code to adjust, but please avoid using it to write your code for you. If you send AI generated scripts, just say they’re AI generated. This is just informal so there’s obviously no “punishment” for violating it, just please don’t do that, it’s a bit annoying
Data
Overall, we have 58 datasets and a total of 12346 rows. Most datasets have about 200 rows. We’re going to need to clean nearly all the columns and filter out rows that aren’t actually cookie recipes. This document includes descriptions of every individual data source/dataset, but almost all of the cleaning will be done on one combined dataset.
Overall issues
All (or nearly all) datasets have these columns (full definition to be added on GitHub):
title
author
rating
ratingnum
prep
cook
total
yield
totalingredients
ingredient1
Choc Chip
Sally
4.7
1813
15 mins
20 mins
2 hrs 5 mins
26 cookies
10
2 stick butter
Some of them have additional columns like genre, course, cuisine, number of steps, and list of steps. The “ingredientX” columns go up to the total number of ingredients. Some also include date posted/updated/etc - the time columns don’t need to be cleaned, they’re only for reference while working with the data
Here are some issues:
Ingredients that are not ingredients: Some datasets have things like “filling:”, “optional”, “shredded” etc included in their ingredientX columns. These are also added to the totalingredients rows for these recipes.
How to fix this:
Remove everything that lacks numbers and comb through what’s removed (because some of them say “salt” or “sprinkles”, which many recipe authors don’t give measurements to)
Remove everything that ends with “:” (ingredients[!grepl(":$", ingredients)] is one way to do that - literally read as “remove every ingredient that ends with :”, :$ is regex meaning “ends with :”)
Make this an if statement or whatever and subtract 1 from totalingredients for every ingredient removed (and then shift everything back in the ingredientX columns)
️NOTE: Most of this has been fixed by rewriting the code to scrape the websites that have this issue. If you look through the datasets and see any non-ingredients in the ingredient columns, please let us know!
Different sources use different terms for the same ingredients. Eg, “caster sugar” vs “superfine sugar”, “all purpose flour” vs “white flour”. The AllRecipes dataset has a standard way that they refer to ingredients (thankfully) so that’s not a concern for that dataset (and we might want to follow the standard for AllRecipes when cleaning everything else)
Fractions are sometimes written as symbols in the ingredients, and all needs to be turned to decimals (Julia can fix this.)
A lot of recipes are NOT ACTUALLY COOKIES!!!!!!!!!
How to fix this:
Separate datasets into data with a category column (many of them have a column that says if it’s a cookie or something else) and datasets without
In the datasets without, filter recipes that have cookie words in the title (what I wrote before was grepl("biscuit|ookie|shortbread|bars|snickerdoodle|dodgers|biscotti", title, ignore.case = TRUE), although “biscuit” is risky. “ookie” includes both cookie and brookie and other puns - including pizookie)
Search through what is removed and search for sugar, molasses, honey, sweetener etc and pick out actual cookie recipes that were removed (most cookies will appear just under sugar, but some cookies have honey or molasses and not sugar, eg the greek cookie moustokouloura, which isn’t called an immediately recognizable “cookie” word)
Measurement and ingredient will need to be separated in some way for modeling
Not every dataset has prep/cook/total time set up properly. Some websites used the same html element to refer to different things, so some datasets have the label prep/cook/etc time in the rows. These need to be turned into columns and rows (we managed to do something like this before, so we can do it again.)
The “yield” or “servings” columns in different datasets are different and these columns are very unique (eg, “servings: 28 cookies”, vs “servings: 28”, vs “servings: 4 5-cookie servings”)
Some recipes have ingredients that repeat themselves, like this recipe from ModernHoney that has a 1/2 version of the recipe listed under the normal recipe, and all of the recipes from America’s Test Kitchen. In the case of the Modern Honey recipe, one of the ingredients is entered as a blank row with only this: “▢” so this may be a pattern throughout the dataset that can help find duplicated recipes. Note that extra ingredients in the ingredientX columns will all add to the totalingredients column
Some recipes MIGHT be AI. It would be great if someone could click through the links to data source websites below and make sure they don’t see any clearly AI generated recipes in there. If you find anything like that, send it in the club Discord!
Some ingredients might have been deleted when the datasets were joined together (and some other columns too, possibly. This is more likely for datasets beginning with C)
How to fix:
Search for instances where totalingredients does not equal the last digit of the last ingredientX column that is filled for that row (I’m sure that’s easier than it sounds)
It’s also a good idea to make sure that all the sources are labeled correctly. Just do this by checking that each “source” has its own first 10 or so digits of each url in the link column.
It might be a good idea to re scrape every website to include “hours” for the prep, cook, custom and total time data…
Original Datasets
Most cleaning will be done to all the data joined into one dataset, but the problems with each are detailed here. Some parts of the cleaning may also be easier to do individually (like filtering for non cookie recipes or fixing ingredients columns - maybe)
Air Fryer Brownies Air Fryer Chocolate Chip Cookies
1 1
Bagels & Doughnuts Cake & Cupcakes
1 3
Candy Christmas
2 10
Cookies, Brownies, & Bars Easter
141 2
Father's Day Halloween
1 4
Keto Peanut Butter Cookies No Bake Desserts
1 6
One Pan Desserts Quick & Easy
2 1
Thanksgiving Trifles & Parfaits
1 1
Valentine's Day
1
Cookies, brownies, and bars. Scraped from the cookie/brownie/bar section: https://amandascookin.com/category/recipes/desserts/cookies-brownies-and-bars
Should all be cookie recipes. Scraped from the offiical cookie section.
title rating ratingnum numberofsteps
Length:359 Min. :3.000 Min. : 1.00 Min. : 6.00
Class :character 1st Qu.:4.000 1st Qu.: 13.00 1st Qu.:11.00
Mode :character Median :4.500 Median : 23.00 Median :12.00
Mean :4.333 Mean : 60.79 Mean :13.34
3rd Qu.:4.500 3rd Qu.: 54.50 3rd Qu.:14.00
Max. :5.000 Max. :2611.00 Max. :34.00
totalingredients
Min. : 6.00
1st Qu.:27.00
Median :30.00
Mean :31.53
3rd Qu.:36.00
Max. :63.00
Notes
1. All ingredients & steps are repeated three (3) times!
2. “Author” column needs brief cleaning; formatted like this: Watch Video_By_Charles Kelsey_Staff Pick_Comments
3. Columns time and yield also formatted oddly. This website doesn’t have prep/cook/total like most. Column time is basically total time but some mention cooling, eg, “1 hour, plus 20 minutes cooling” or “1 hour, plus 9 hours chilling and cooling”
Should all be cookies. Scraped from the cookie category: https://www.amummytoo.co.uk/category/cookies
title rating ratingnum numberofsteps
Length:60 Min. :4.500 Min. : 1.000 Min. : 7.00
Class :character 1st Qu.:5.000 1st Qu.: 1.000 1st Qu.: 9.00
Mode :character Median :5.000 Median : 1.000 Median :10.00
Mean :4.909 Mean : 2.389 Mean :11.47
3rd Qu.:5.000 3rd Qu.: 2.000 3rd Qu.:13.00
Max. :5.000 Max. :17.000 Max. :22.00
NA's :6 NA's :6
totalingredients
Min. : 6.00
1st Qu.:14.00
Median :16.00
Mean :17.13
3rd Qu.:20.00
Max. :34.00
Bread, Cookies Breakfast
1 2
cakes and bakes Cookies
2 32
Cookies, Dessert, Snack Cookies, Dessert, Snacks
1 1
Cookies, Desserts and sweet treats Cookies, Easter
2 5
Cookies, Festive makes Cookies, halloween
2 1
Dessert, Snack Desserts and sweet treats
3 3
Festive makes Snack
1 3
Snacks
1
American American, British Austrian British
40 1 1 11
Italian Scottish
2 5
rating ratingnum numberofsteps totalingredients
Min. :3.000 Min. : 1.00 Min. : 2.000 Min. : 5.0
1st Qu.:4.740 1st Qu.: 3.00 1st Qu.: 6.000 1st Qu.:10.0
Median :4.920 Median : 6.00 Median : 8.000 Median :12.0
Mean :4.811 Mean : 19.36 Mean : 8.144 Mean :12.5
3rd Qu.:5.000 3rd Qu.: 15.00 3rd Qu.:10.000 3rd Qu.:14.0
Max. :5.000 Max. :441.00 Max. :27.000 Max. :23.0
American American, Baking, Chocolate, Cookies
47 1
American, Baking, Cookies Baking, Chocolate, Cookies
3 4
Baking, Chocolate, Cookies, Italian Baking, Cookies
1 2
Baking, Cookies, Lemon Candy, Chocolate
1 1
Cheesecake Chocolate
1 1
Cookies French
40 2
Italian
1
Breakfast Dessert
3 102
Should all be cookies. Scraped from the cookie section: https://bakerbynature.com/recipe-index/?_desserts=cookies
NOT all recipes are cookies. Website was scraped from a search for “cookies”.
title rating ratingnum totalingredients
Length:298 Min. :2.500 Min. :5 Min. : 1.00
Class :character 1st Qu.:4.000 1st Qu.:5 1st Qu.: 7.00
Mode :character Median :4.500 Median :5 Median : 9.00
Mean :4.354 Mean :5 Mean : 9.57
3rd Qu.:4.800 3rd Qu.:5 3rd Qu.:11.00
Max. :5.000 Max. :5 Max. :31.00
NA's :21 NA's :21
American American, Asian
55 7
American, Asian, Filipino American, Asian, French
1 1
American, Asian, Japanese American, Filipino
2 2
American, Filipino, French American, French
1 1
Asian Asian, Filipino
4 1
Asian, French Filipino
1 4
Dessert
80
Should all be cookies. Scraped from the cookie category: https://bitesbybianca.com/category/dessert/cookies
title rating ratingnum totalingredients
Length:137 Length:137 Min. : 1.00 Min. : 3.0
Class :character Class :character 1st Qu.: 1.25 1st Qu.:10.0
Mode :character Mode :character Median : 4.00 Median :11.0
Mean : 10.09 Mean :11.9
3rd Qu.: 10.00 3rd Qu.:14.0
Max. :281.00 Max. :26.0
NA's :31
Should all be cookies. Scraped from cookie section: https://www.bostongirlbakes.com/category/cookies
ingredients were taken incorrectly, so there will be non-ingredients in the ingredientX columns, and they will add to the totalingredients columns.
there is no NA in the ratingnum column. instead, where rating has NA, ratingnum has 0. this is actually ideal, most of the datasets don’t have this, they just have NA for both.
rating ratingnum servings numberofsteps totalingredients
Min. :3.670 Min. : 1.00 Min. :24 Min. : 4.000 Min. : 5.00
1st Qu.:4.910 1st Qu.: 7.00 1st Qu.:24 1st Qu.: 7.000 1st Qu.: 9.00
Median :5.000 Median :11.00 Median :24 Median : 9.000 Median :11.00
Mean :4.897 Mean :14.25 Mean :24 Mean : 9.731 Mean :11.23
3rd Qu.:5.000 3rd Qu.:17.00 3rd Qu.:24 3rd Qu.:11.000 3rd Qu.:13.00
Max. :5.000 Max. :72.00 Max. :24 Max. :27.000 Max. :20.00
NA's :92
American American, International Austrian
3 1 2
Brazilian British, Scottish Eastern European
1 1 1
Est European Filipino French
2 1 4
French, International Greek Greek, International
1 6 2
International Italian Portuguese
21 3 2
Romanian Russian South American, Spanish
1 1 1
Spanish Swiss Turkish
1 1 3
Venezuelan
1
Should all be cookies. Scraped from cookie section: https://www.chefspencil.com/recipe-courses/dessert/cookies
“number of steps” seems like it might be inaccurate (although this applies to any datasets with the number of steps)
rating ratingnum numberofsteps totalingredients
Min. :3.000 Min. : 1.0 Min. : 4.00 Min. : 5.00
1st Qu.:4.800 1st Qu.: 2.0 1st Qu.:11.00 1st Qu.:11.00
Median :4.960 Median : 17.0 Median :14.00 Median :15.00
Mean :4.867 Mean : 38.1 Mean :15.39 Mean :15.18
3rd Qu.:5.000 3rd Qu.: 40.0 3rd Qu.:20.50 3rd Qu.:19.00
Max. :5.000 Max. :239.0 Max. :30.00 Max. :28.00
American French
78 1
Notes
- The time is taken in such a way that it doesn’t need to be re-scraped
title author rating ratingnum
Length:405 Length:405 Min. :1.50 Min. : 1.00
Class :character Class :character 1st Qu.:4.50 1st Qu.: 2.00
Mode :character Mode :character Median :4.80 Median : 4.00
Mean :4.65 Mean : 12.33
3rd Qu.:5.00 3rd Qu.: 13.00
Max. :5.00 Max. :216.00
NA's :141 NA's :141
preptime totaltime
Length:405 Length:405
Class :character Class :character
Mode :character Mode :character
Bars Breakfast, Dessert, Snack Cake
1 1 1
Candy Cheesecake Cookie
4 3 1
Cookie Bar Cookie Bars Cookies
2 5 186
Desert Dessert Dog Treats
1 110 1
Fruit Pastry Pie
1 1 1
Snack
1
< table of extent 0 >
Should all be cookies. Scraped from cookie category: https://cookiesandcups.com/recipes/cookies/
Re-scraped but there’s still some issues with the ingredients :(
Some things in the ingredientX columns aren’t ingredients, like “Cookies” or “Frosting” or “Icing”, which the author used as section headers. These non-ingredients add to the totalingredients which will need to be fixed. (Probably when cleaning the entire dataset, filter out everything that has no unit of measurement, and find things that don’t belong)
rating ratingnum numberofsteps totalingredients
Min. :4.000 Min. : 1.00 Min. : 1.000 Min. : 2.00
1st Qu.:4.980 1st Qu.: 8.00 1st Qu.: 2.000 1st Qu.: 7.00
Median :5.000 Median : 26.50 Median : 3.000 Median :12.00
Mean :4.963 Mean : 61.53 Mean : 3.336 Mean :11.89
3rd Qu.:5.000 3rd Qu.: 70.50 3rd Qu.: 4.000 3rd Qu.:16.00
Max. :5.000 Max. :610.00 Max. :10.000 Max. :34.00
NA's :2 NA's :2
Unique items in course
Appetizer, hors d’oeuvres Appetizer, Main Course, Side Dish, Soup
1 1
Appetizer, Side Dish, Snack Appetizer, Snack
1 3
Beverages, Drinks bread, Side Dish
1 1
Breakfast Breakfast, Brunch
4 1
Breakfast, Brunch, Dessert Breakfast, Brunch, Main Course
1 1
Breakfast, Dessert Breakfast, Dessert, Snack
2 1
Brunch, Dessert, Lunch Candy, condiment, Dessert
1 1
Candy, condiment, Ingredient Candy, condiment, Snack
1 1
Candy, Dessert Candy, Dessert, Snack
1 1
condiment condiment, dip, Dressing, Sauce
10 1
condiment, dip, glaze condiment, Seasoning
1 1
condiment, Seasoning Blend condiment, Syrup
1 1
Dessert Dessert, Ingredient
41 1
Dessert, Main Course Dessert, Snack
2 2
Dessert, Tea Dog Treats
2 1
Entree, Main Dish Entree, Main Dish, Soup
1 1
Ingredient Main Course
2 4
Main Dish Main Dish, Stew
2 1
Salad Salad, Side Dish
2 1
Salad, Side Dish, Starter Seasoning Blend, Spice Mix
1 1
Seasonings Side Dish
1 2
Side Dish, vegetables Soup
1 4
Unique items in cuisine
All All, American American
4 1 24
American, Italian American, Southern Asian, Chinese
1 1 1
Australian, New Zealand Austrian, French, German Austrian, German
1 1 4
Austrian, German, Italian British British, Cornish
1 2 1
British, english British, Scottish Chinese
4 1 1
danish english French
1 2 5
French, German, Italian French, Italian French, Provencal
1 1 1
German German, Italian German, Swiss
20 1 1
Greek Greek, Sephardic Jewish Indian
2 1 1
Italian Jewish Mexican
2 1 2
Middle Eastern Portuguese Welsh
1 1 1
Not all cookies. Scrape off search results for “cookie” on the website https://www.daringgourmet.com/?s=cookie
Ingredients column includes things like “Day#:” (for multiple day long recipes) as their own ingredient. This will add to the totalingredients columns for rows with that. Filter for things adding in : to remove these.
“number of steps” is incorrect on a decent amount of recipes because of weird website formatting
Filtering title by ‘cookie’ might not work with this dataset, but filtering by searching for “cookie” in any row might (some of the titles say “Authentic Pfeffernüsse”, but pfeffernüsse is a cookie)
rating ratingnum actualsteps ingredient1
Min. :2.000 Min. : 1.000 Length:704 Length:704
1st Qu.:5.000 1st Qu.: 2.000 Class :character Class :character
Median :5.000 Median : 2.000 Mode :character Mode :character
Mean :4.837 Mean : 3.822
3rd Qu.:5.000 3rd Qu.: 5.000
Max. :5.000 Max. :31.000
NA's :411 NA's :411
American American, Austrian American, Danish American, Dutch
673 1 1 1
American, French American, Mexican Cookies Danish
1 1 2 2
dog treats French German Italian
1 4 1 2
Scottish
1
SHOULD be all cookie recipes, because the website only has cookie recipes, but this was scraped off the website’s search results/recipe index. Also some of these are dog cookies
- ingredients scraped properly
- steps included
- very few recipes have ratings
- empty “customtime” column (remove)
rating ratingnum totalingredients
Min. :4.000 Min. : 1.00 Min. : 8.00
1st Qu.:4.670 1st Qu.: 2.00 1st Qu.:10.00
Median :4.880 Median : 4.00 Median :11.00
Mean :4.791 Mean : 10.36 Mean :11.28
3rd Qu.:5.000 3rd Qu.: 8.00 3rd Qu.:12.00
Max. :5.000 Max. :105.00 Max. :16.00
Macarons Madeleines
10 3
American Austrian French German Italian
8 1 14 1 1
title rating ratingnum numberofsteps
Length:24 Min. :4.670 Min. :1.00 Min. : 3.00
Class :character 1st Qu.:5.000 1st Qu.:1.75 1st Qu.: 6.00
Mode :character Median :5.000 Median :2.00 Median : 8.50
Mean :4.967 Mean :2.70 Mean : 8.25
3rd Qu.:5.000 3rd Qu.:3.00 3rd Qu.:10.00
Max. :5.000 Max. :8.00 Max. :14.00
NA's :4 NA's :4
totalingredients
Min. : 4.00
1st Qu.: 7.75
Median :11.00
Mean :10.46
3rd Qu.:12.00
Max. :16.00
Contains cookies without “cookie” in title (“Sitto’s Date Ma’amoul”)
ratingnum cuisine servings
Min. : 1.0 Length:138 Length:138
1st Qu.: 19.0 Class :character Class :character
Median : 70.0 Mode :character Mode :character
Mean : 372.8
3rd Qu.: 238.0
Max. :6056.0
NA's :5
American American, British American, english, French
106 2 1
American, French American, Greek American, Italian
2 1 1
American, russian Austrian British
1 1 1
British, Scottish French German
3 3 4
Greek Italian Jewish
1 7 1
latin Mexican Mexican, Swedish
1 1 1
Need to rescrape. Some recipes are duplicated because of vector strings for prep time or similar.
Should all be cookies already - scraped from the cookie category: https://sallysbakingaddiction.com/category/desserts/cookies
description rating totaltime
Length:258 Min. :3.800 Length:258
Class :character 1st Qu.:4.600 Class :character
Mode :character Median :4.800 Mode :character
Mean :4.713
3rd Qu.:4.900
Max. :5.000
Warning: Unknown or uninitialised column: `genre`.
Should all be cookies. Scraped from the cookie section: https://www.tasteofhome.com/recipes/dishes-beverages/cookies
Notes:
1. Has both “author” and “recipe submitter” (and tester, but that column can be removed)
2. The “time” and “yield” categories are off. This is what the first time column looks like: Total Time:Prep: 15 min. Bake: 10 min./batch , this is what the second time column looks like: Yield:about 5 dozen Prep:15 min Cook:10 min . “Total time” wasn’t a number given on the recipes, so it will have to be calculated by adding together the other time values.
rating ratingnum servings numberofsteps
Min. :4.000 Min. : 1.00 Min. : 4.00 Min. : 5.000
1st Qu.:5.000 1st Qu.: 5.00 1st Qu.:16.00 1st Qu.: 7.000
Median :5.000 Median : 9.00 Median :22.00 Median : 9.000
Mean :4.986 Mean : 17.12 Mean :20.72 Mean : 9.198
3rd Qu.:5.000 3rd Qu.: 18.00 3rd Qu.:24.00 3rd Qu.:10.000
Max. :5.000 Max. :178.00 Max. :40.00 Max. :26.000
NA's :12 NA's :12
totalingredients
Min. : 4.00
1st Qu.: 9.00
Median :11.00
Mean :10.94
3rd Qu.:12.00
Max. :39.00
American American, Austrian
65 1
American, Austrian, French American, European
2 1
American, European, French American, French
1 2
American, German American, Italian
2 1
American, Jewish American, Mexican
1 1
American, Middle Eastern Austrian
2 1
British, European, UK European
1 14
European, Scottish French
1 8
French, Italian Indian
1 1
International Israeli
8 3
Italian Mediterranean, Spanish
2 2
Middle Eastern Scottish
3 2