In this analysis, I want to examine what factors contribute to a recipe’s success on Epicurious. We’ll build a simple model that attempts to predict a recipe’s rating based upon some common characteristics.
First, let’s read in the dataset and take a look at it.
recipes <- read_csv('https://chapple-datasets.s3.amazonaws.com/epicurious.csv')
#summary(recipes)
A few observations here. First, that took a long time to read in because the file size was quite large. I can compress the file to a ZIP file and it will read in faster. I’ve already stored a ZIPed copy of the same file in my AWS account, so we’ll use that file next time to speed things along.
Second, almost all of the variables here are logical datatypes. I don’t want to write a column definition that repeats the datatype 600+ times, so I am going to read the dataset using a default column specification of type logical and then override that for the variables that have other datatypes. Let’s give that a try.
recipes <- read_csv('https://chapple-datasets.s3.amazonaws.com/epicurious.csv.gz', col_types = cols(.default = "l", title='c', rating='n', calories='n', protein='n', fat='n', sodium='n'))
#summary(recipes)
Now that I’ve done all of that work, I’d like to simplify my analysis. I’m going to just work with the numeric features and lop off all the logical features. This will allow me to try a simple linear regression.
recipes <- recipes %>%
select(title,rating,calories,protein,fat,sodium)
summary(recipes)
## title rating calories protein
## Length:20052 Min. :0.000 Min. : 0 Min. : 0.0
## Class :character 1st Qu.:3.750 1st Qu.: 198 1st Qu.: 3.0
## Mode :character Median :4.375 Median : 331 Median : 8.0
## Mean :3.714 Mean : 6323 Mean : 100.2
## 3rd Qu.:4.375 3rd Qu.: 586 3rd Qu.: 27.0
## Max. :5.000 Max. :30111218 Max. :236489.0
## NA's :4117 NA's :4162
## fat sodium
## Min. : 0.0 Min. : 0
## 1st Qu.: 7.0 1st Qu.: 80
## Median : 17.0 Median : 294
## Mean : 346.9 Mean : 6226
## 3rd Qu.: 33.0 3rd Qu.: 711
## Max. :1722763.0 Max. :27675110
## NA's :4183 NA's :4119
I notice that there are a lot of NA values in my dataset. They’re going to complicate things as well. The simplest thing I can do here is to just eliminate rows with NA values, so let’s do that.
recipes <- recipes %>%
filter(!is.na(calories) & !is.na(protein) & !is.na(fat) & !is.na(sodium))
The next thing I notice in that summary is that there are some outlier values. For example, I don’t think it’s reasonable that a recipe would have 30 million calories! Let’s plot our numeric variables.
recipes %>%
keep(is.numeric) %>%
gather() %>%
ggplot(mapping=aes(x=key, y=value)) +
geom_boxplot() +
facet_wrap(~ key, scales='free')
Let’s remove any rows with values that seem too high and replot.
recipes <- recipes %>%
filter(calories<2000 & protein < 200 & fat < 200 & sodium < 2000)
recipes %>%
keep(is.numeric) %>%
gather() %>%
ggplot(mapping=aes(x=key, y=value)) +
geom_boxplot() +
facet_wrap(~ key, scales='free')
OK, now I have a dataset that is fairly clean. Let’s move on and explore it some more. Let’s see what the correlation plot looks like for this dataset.
recipes %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
This plot tells me that my original idea of building a simple model to predict ratings based upon simple nutrition facts probably won’t work. If I want to pursue the ratings prediction problem, I’d need to bring in more features or try some other techniques.
I’m going to try something else instead. Let’s see if we can predict the calories of a recipe based upon the other nutrition facts. That seems like something that we should be able to do.
Now I’ll build my regression model and see how accurately it predicts the number of calories in a recipe.
Before we start building a model, I want to separate my dataset into a training dataset that will be used to create the model and a test dataset that will be used to evaluate it. This is to prevent overfitting. I’ll put 80% of the data into the training dataset and the other 20% into a testing dataset.
set.seed(1842)
training_size = 0.8 * nrow(recipes)
training_rows = sample(seq_len(nrow(recipes)), size=training_size)
train <- recipes[training_rows,]
test <- recipes[-training_rows,]
My training dataset has 11880 rows and my test dataset has 2971 rows.
OK, let’s run a simple linear regression
model <- lm(calories ~ protein + fat + sodium, data=train)
summary(model)
##
## Call:
## lm(formula = calories ~ protein + fat + sodium, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.34 -90.34 -29.98 57.57 1621.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.139e+02 1.743e+00 65.36 <2e-16 ***
## protein 3.727e+00 7.191e-02 51.82 <2e-16 ***
## fat 9.554e+00 6.317e-02 151.23 <2e-16 ***
## sodium 3.254e-02 3.125e-03 10.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 11876 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.8413
## F-statistic: 2.1e+04 on 3 and 11876 DF, p-value: < 2.2e-16
That looks pretty good. It’s telling me that on my training dataset, the simple regression model I built was able to explain about 84% of the variance.
And now let’s try using that model on our test dataset and see how it performs.
caloriePrediction <- predict(model, test)
error <- caloriePrediction - test$calories
The RMSE for this model is 124.7096784. This seems like a reasonable calorie prediction error for our simple model.