In this analysis, I want to examine what factors contribute to a recipe’s success on Epicurious. We’ll build a simple model that attempts to predict a recipe’s rating based upon some common characteristics.

Dataset Cleaning

First, let’s read in the dataset and take a look at it.

recipes <- read_csv('https://chapple-datasets.s3.amazonaws.com/epicurious.csv')
#summary(recipes)

A few observations here. First, that took a long time to read in because the file size was quite large. I can compress the file to a ZIP file and it will read in faster. I’ve already stored a ZIPed copy of the same file in my AWS account, so we’ll use that file next time to speed things along.

Second, almost all of the variables here are logical datatypes. I don’t want to write a column definition that repeats the datatype 600+ times, so I am going to read the dataset using a default column specification of type logical and then override that for the variables that have other datatypes. Let’s give that a try.

recipes <- read_csv('https://chapple-datasets.s3.amazonaws.com/epicurious.csv.gz', col_types = cols(.default = "l", title='c', rating='n', calories='n', protein='n', fat='n', sodium='n'))
#summary(recipes)

Trimming the Dataset

Now that I’ve done all of that work, I’d like to simplify my analysis. I’m going to just work with the numeric features and lop off all the logical features. This will allow me to try a simple linear regression.

recipes <- recipes %>%
  select(title,rating,calories,protein,fat,sodium)
summary(recipes)
##     title               rating         calories           protein        
##  Length:20052       Min.   :0.000   Min.   :       0   Min.   :     0.0  
##  Class :character   1st Qu.:3.750   1st Qu.:     198   1st Qu.:     3.0  
##  Mode  :character   Median :4.375   Median :     331   Median :     8.0  
##                     Mean   :3.714   Mean   :    6323   Mean   :   100.2  
##                     3rd Qu.:4.375   3rd Qu.:     586   3rd Qu.:    27.0  
##                     Max.   :5.000   Max.   :30111218   Max.   :236489.0  
##                                     NA's   :4117       NA's   :4162      
##       fat                sodium        
##  Min.   :      0.0   Min.   :       0  
##  1st Qu.:      7.0   1st Qu.:      80  
##  Median :     17.0   Median :     294  
##  Mean   :    346.9   Mean   :    6226  
##  3rd Qu.:     33.0   3rd Qu.:     711  
##  Max.   :1722763.0   Max.   :27675110  
##  NA's   :4183        NA's   :4119

I notice that there are a lot of NA values in my dataset. They’re going to complicate things as well. The simplest thing I can do here is to just eliminate rows with NA values, so let’s do that.

recipes <- recipes %>%
  filter(!is.na(calories) & !is.na(protein) & !is.na(fat) & !is.na(sodium))

Outliers

The next thing I notice in that summary is that there are some outlier values. For example, I don’t think it’s reasonable that a recipe would have 30 million calories! Let’s plot our numeric variables.

recipes %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(mapping=aes(x=key, y=value)) +
  geom_boxplot() + 
  facet_wrap(~ key, scales='free')

Let’s remove any rows with values that seem too high and replot.

recipes <- recipes %>%
  filter(calories<2000 & protein < 200 & fat < 200 & sodium < 2000)

recipes %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(mapping=aes(x=key, y=value)) +
  geom_boxplot() + 
  facet_wrap(~ key, scales='free')

Exploring the Dataset

OK, now I have a dataset that is fairly clean. Let’s move on and explore it some more. Let’s see what the correlation plot looks like for this dataset.

recipes %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

This plot tells me that my original idea of building a simple model to predict ratings based upon simple nutrition facts probably won’t work. If I want to pursue the ratings prediction problem, I’d need to bring in more features or try some other techniques.

I’m going to try something else instead. Let’s see if we can predict the calories of a recipe based upon the other nutrition facts. That seems like something that we should be able to do.

Building a Regression Model

Now I’ll build my regression model and see how accurately it predicts the number of calories in a recipe.

Separating test and training data

Before we start building a model, I want to separate my dataset into a training dataset that will be used to create the model and a test dataset that will be used to evaluate it. This is to prevent overfitting. I’ll put 80% of the data into the training dataset and the other 20% into a testing dataset.

set.seed(1842)
training_size = 0.8 * nrow(recipes)

training_rows = sample(seq_len(nrow(recipes)), size=training_size)
train <- recipes[training_rows,]
test <- recipes[-training_rows,]

My training dataset has 11880 rows and my test dataset has 2971 rows.

Creating the model

OK, let’s run a simple linear regression

model <- lm(calories ~ protein + fat + sodium, data=train)
summary(model)
## 
## Call:
## lm(formula = calories ~ protein + fat + sodium, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.34  -90.34  -29.98   57.57 1621.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.139e+02  1.743e+00   65.36   <2e-16 ***
## protein     3.727e+00  7.191e-02   51.82   <2e-16 ***
## fat         9.554e+00  6.317e-02  151.23   <2e-16 ***
## sodium      3.254e-02  3.125e-03   10.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125 on 11876 degrees of freedom
## Multiple R-squared:  0.8414, Adjusted R-squared:  0.8413 
## F-statistic: 2.1e+04 on 3 and 11876 DF,  p-value: < 2.2e-16

That looks pretty good. It’s telling me that on my training dataset, the simple regression model I built was able to explain about 84% of the variance.

Evaluating the model

And now let’s try using that model on our test dataset and see how it performs.

caloriePrediction <- predict(model, test)
error <- caloriePrediction - test$calories

The RMSE for this model is 124.7096784. This seems like a reasonable calorie prediction error for our simple model.