Burrito Perfection

We all like burritos, but what makes them so good? Could it be the meat, the size, the quality of the salsa? Here, we will use multiple regression to attempt to answer the age old question: how do I make the best burrito?

Data Prep

Many have come before us. Some even left nice datasets behind: https://srcole.github.io/100burritos/. We will use this 2016 dataset of burritos in San Diego to help us answer our question.

Importing

#Here's a package for grabbing data from a google docs sheet
#install.packages('gsheet')
library(gsheet)
library(tidyr)

Cleaning

df <- gsheet2tbl("https://docs.google.com/spreadsheets/d/18HkrklYz1bKpDLeL-kaMrGjAhUM6LeJMIACwEljCgaw/edit#gid=1703829449")

## Warning: Duplicated column names deduplicated: 'Salsa' => 'Salsa_1' [50]

#Data Prep

#Getting rid of some stuff
df[,c(1,3,4,5,6,7,8,9,12,13,27,28,29,30,31)] <- NULL 

#Changing to True/False where appropriate
df[,17:51] <- !is.na(df[,17:51])


#Tossing incomplete observations that I don't want to deal with
df <- df %>% 
  drop_na(c(Length,Circum,Volume,Cost,Hunger,overall,Meat,Temp,Fillings,Uniformity,Salsa,Wrap))



#Getting rid of a few more variables which don't have many TRUE cases
df[,c("Tomato","Taquito","Cabbage","Carrots","Fish","Lettuce","Bell peper","Salsa_1","Ham","Pineapple","Chile relleno","Nopales","Lobster","Queso","Egg","Mushroom","Bacon","Sushi","Corn","Zucchini")] <- NULL

summary(df)

##    Burrito               Cost            Hunger         Length     
##  Length:249         Min.   : 3.990   Min.   :1.00   Min.   :15.00  
##  Class :character   1st Qu.: 6.250   1st Qu.:3.00   1st Qu.:18.50  
##  Mode  :character   Median : 7.000   Median :3.50   Median :20.00  
##                     Mean   : 7.166   Mean   :3.51   Mean   :19.99  
##                     3rd Qu.: 7.890   3rd Qu.:4.00   3rd Qu.:21.50  
##                     Max.   :25.000   Max.   :5.00   Max.   :26.00  
##      Circum          Volume          Tortilla          Temp      
##  Min.   :17.00   Min.   :0.4000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:21.00   1st Qu.:0.6800   1st Qu.:3.000   1st Qu.:3.000  
##  Median :22.00   Median :0.7700   Median :3.500   Median :4.000  
##  Mean   :22.15   Mean   :0.7859   Mean   :3.514   Mean   :3.739  
##  3rd Qu.:23.00   3rd Qu.:0.8800   3rd Qu.:4.000   3rd Qu.:4.500  
##  Max.   :29.00   Max.   :1.5400   Max.   :5.000   Max.   :5.000  
##       Meat          Fillings      Meat:filling     Uniformity   
##  Min.   :1.500   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.500  
##  Median :3.800   Median :3.500   Median :3.500   Median :3.500  
##  Mean   :3.608   Mean   :3.493   Mean   :3.474   Mean   :3.348  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Salsa          Synergy           Wrap          overall     
##  Min.   :1.000   Min.   :1.000   Min.   :0.500   Min.   :1.500  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.500   1st Qu.:3.200  
##  Median :3.500   Median :3.800   Median :4.000   Median :3.800  
##  Mean   :3.405   Mean   :3.574   Mean   :3.956   Mean   :3.589  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     Beef            Pico            Guac           Cheese       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:150       FALSE:163       FALSE:162       FALSE:157      
##  TRUE :99        TRUE :86        TRUE :87        TRUE :92       
##                                                                 
##                                                                 
##                                                                 
##    Fries         Sour cream         Pork          Chicken       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:173       FALSE:191       FALSE:219       FALSE:237      
##  TRUE :76        TRUE :58        TRUE :30        TRUE :12       
##                                                                 
##                                                                 
##                                                                 
##    Shrimp           Rice           Beans           Sauce        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:236       FALSE:229       FALSE:240       FALSE:230      
##  TRUE :13        TRUE :20        TRUE :9         TRUE :19       
##                                                                 
##                                                                 
##                                                                 
##   Cilantro         Onion          Avocado       
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:239       FALSE:238       FALSE:241      
##  TRUE :10        TRUE :11        TRUE :8        
##                                                 
##                                                 
##

After cleaning the data, we are left with 31 variables. Many variables, including the responce variable ‘overall’, are scores ranging from 1 to 5; we’ll consider these as continuous data. There are also quite a few categorical variables which detail some of what the burritos contain. With all this, we should be able to make some models.

Modeling

Let’s start out with a simple linear model which considers all of the variables we have available. The quality, represented by ‘overall’ of a burrito ranges from 1 at lowest to 5 at highest.

model <- lm(overall ~ Cost + #I didn't put the type of borrito in here: it had little effect and made lot's of extra coefficients
              Hunger +
              Length +
              Circum +
              Volume +
              Tortilla +
              Temp +
              Meat +
              Fillings +
              `Meat:filling` +
              Uniformity +
              Salsa +
              Synergy +
              Wrap +
              Beef +
              Pico +
              Guac +
              Cheese +
              Fries +
              `Sour cream` +
              Pork +
              Chicken +
              Shrimp +
              Rice +
              Beans +
              Sauce +
              Cilantro +
              Onion +
              Avocado,
            data=df)
summary(model)

## 
## Call:
## lm(formula = overall ~ Cost + Hunger + Length + Circum + Volume + 
##     Tortilla + Temp + Meat + Fillings + `Meat:filling` + Uniformity + 
##     Salsa + Synergy + Wrap + Beef + Pico + Guac + Cheese + Fries + 
##     `Sour cream` + Pork + Chicken + Shrimp + Rice + Beans + Sauce + 
##     Cilantro + Onion + Avocado, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32647 -0.16054  0.01168  0.15164  1.14428 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.168164   1.798258  -1.206 0.229233    
## Cost              0.009699   0.013125   0.739 0.460703    
## Hunger            0.061112   0.028483   2.146 0.033011 *  
## Length            0.042572   0.044434   0.958 0.339070    
## Circum            0.070036   0.080364   0.871 0.384444    
## Volume           -0.681193   1.094136  -0.623 0.534205    
## Tortilla          0.014968   0.029942   0.500 0.617649    
## Temp              0.077866   0.023237   3.351 0.000948 ***
## Meat              0.264482   0.034522   7.661 5.88e-13 ***
## Fillings          0.202161   0.038149   5.299 2.83e-07 ***
## `Meat:filling`    0.126618   0.026183   4.836 2.50e-06 ***
## Uniformity        0.062510   0.023255   2.688 0.007739 ** 
## Salsa            -0.039165   0.027835  -1.407 0.160826    
## Synergy           0.260899   0.039201   6.655 2.23e-10 ***
## Wrap              0.053747   0.020302   2.647 0.008702 ** 
## BeefTRUE         -0.066410   0.104181  -0.637 0.524497    
## PicoTRUE         -0.178837   0.063325  -2.824 0.005179 ** 
## GuacTRUE          0.098351   0.069087   1.424 0.155989    
## CheeseTRUE        0.057516   0.108505   0.530 0.596597    
## FriesTRUE         0.074103   0.098034   0.756 0.450525    
## `Sour cream`TRUE  0.008481   0.076047   0.112 0.911304    
## PorkTRUE         -0.113465   0.106840  -1.062 0.289402    
## ChickenTRUE      -0.475406   0.138285  -3.438 0.000702 ***
## ShrimpTRUE       -0.421615   0.188868  -2.232 0.026606 *  
## RiceTRUE         -0.081457   0.109092  -0.747 0.456057    
## BeansTRUE         0.052226   0.145403   0.359 0.719804    
## SauceTRUE         0.167387   0.149789   1.117 0.265011    
## CilantroTRUE      0.336391   0.216008   1.557 0.120841    
## OnionTRUE        -0.391410   0.210758  -1.857 0.064631 .  
## AvocadoTRUE       0.100948   0.197943   0.510 0.610574    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.302 on 219 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.818 
## F-statistic: 39.43 on 29 and 219 DF,  p-value: < 2.2e-16

Honestly, I didn’t expect it to work so well on the first try. Below is a plot of the fitted values against the actual ones. As you can see, they are pretty close.

plot(df$overall ~ model$fitted.values)
abline(0,1, col = "red")

Let’s tweak the model a bit by knocking out some of the not-so-significant coefficients.

model2 <- lm(overall ~ Temp +
              Meat +
              Fillings +
              `Meat:filling` +
              Uniformity +
              Synergy +
              Wrap +
              Pico +
              Chicken +
              Shrimp +
              Cilantro +
              Onion,
            data=df)
summary(model2)

## 
## Call:
## lm(formula = overall ~ Temp + Meat + Fillings + `Meat:filling` + 
##     Uniformity + Synergy + Wrap + Pico + Chicken + Shrimp + Cilantro + 
##     Onion, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5236 -0.1683  0.0088  0.1542  1.1670 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.05805    0.14318  -0.405 0.685549    
## Temp            0.08244    0.02198   3.750 0.000222 ***
## Meat            0.23189    0.03292   7.044 2.03e-11 ***
## Fillings        0.23632    0.03712   6.367 9.95e-10 ***
## `Meat:filling`  0.11429    0.02448   4.669 5.09e-06 ***
## Uniformity      0.06214    0.02249   2.763 0.006174 ** 
## Synergy         0.26988    0.03765   7.169 9.66e-12 ***
## Wrap            0.04679    0.01869   2.503 0.012997 *  
## PicoTRUE       -0.11972    0.04277  -2.799 0.005545 ** 
## ChickenTRUE    -0.39372    0.09404  -4.187 4.00e-05 ***
## ShrimpTRUE     -0.20541    0.09238  -2.224 0.027124 *  
## CilantroTRUE    0.33484    0.19374   1.728 0.085246 .  
## OnionTRUE      -0.46496    0.18838  -2.468 0.014287 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3072 on 236 degrees of freedom
## Multiple R-squared:  0.8207, Adjusted R-squared:  0.8116 
## F-statistic: 90.01 on 12 and 236 DF,  p-value: < 2.2e-16

Much simpler now, with pretty similar stats aside from the F-statistic, which went way up. Now, for this assignment, we also need a quadratic term.

model3 <- lm(overall ~ Temp +
              I(Fillings^2) + # The I() function will let you do quadratic terms specified within the lm function
              Meat +
              Fillings +
              `Meat:filling` +
              Uniformity +
              Synergy +
              Wrap +
              Pico +
              Chicken +
              Shrimp +
              Cilantro +
              Onion,
            data=df)
summary(model3)

## 
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings + 
##     `Meat:filling` + Uniformity + Synergy + Wrap + Pico + Chicken + 
##     Shrimp + Cilantro + Onion, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55848 -0.15079 -0.00374  0.15549  1.14394 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.95801    0.28404  -3.373 0.000870 ***
## Temp            0.08164    0.02143   3.809 0.000178 ***
## I(Fillings^2)  -0.09122    0.02507  -3.638 0.000338 ***
## Meat            0.23585    0.03212   7.344 3.39e-12 ***
## Fillings        0.85260    0.17321   4.922 1.61e-06 ***
## `Meat:filling`  0.11531    0.02387   4.831 2.45e-06 ***
## Uniformity      0.05837    0.02195   2.659 0.008372 ** 
## Synergy         0.25998    0.03681   7.063 1.83e-11 ***
## Wrap            0.03330    0.01860   1.790 0.074661 .  
## PicoTRUE       -0.12706    0.04175  -3.044 0.002604 ** 
## ChickenTRUE    -0.30507    0.09488  -3.216 0.001485 ** 
## ShrimpTRUE     -0.23757    0.09050  -2.625 0.009235 ** 
## CilantroTRUE    0.21707    0.19166   1.133 0.258541    
## OnionTRUE      -0.36168    0.18586  -1.946 0.052844 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2996 on 235 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8209 
## F-statistic: 88.41 on 13 and 235 DF,  p-value: < 2.2e-16

Slightly better having added an additional term for quadratic Fillings. Now let’s add an interaction term and see what that does.

model4 <- lm(overall ~ Temp +
              I(Fillings^2) + # The I() function will let you do quadratic terms specified within the lm function
              Meat +
              Fillings +
              `Meat:filling` +
              Meat:Chicken +
              Uniformity +
              Synergy +
              Wrap +
              Pico +
              Chicken +
              Shrimp +
              Cilantro +
              Onion,
            data=df)

summary(model4)

## 
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings + 
##     `Meat:filling` + Meat:Chicken + Uniformity + Synergy + Wrap + 
##     Pico + Chicken + Shrimp + Cilantro + Onion, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55583 -0.16063 -0.00513  0.15138  1.14148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.80548    0.28892  -2.788 0.005740 ** 
## Temp              0.08140    0.02123   3.833 0.000162 ***
## I(Fillings^2)    -0.08272    0.02511  -3.294 0.001139 ** 
## Meat              0.21911    0.03262   6.717 1.39e-10 ***
## Fillings          0.77532    0.17478   4.436 1.41e-05 ***
## `Meat:filling`    0.12343    0.02390   5.164 5.18e-07 ***
## Uniformity        0.05204    0.02192   2.374 0.018385 *  
## Synergy           0.27838    0.03731   7.461 1.68e-12 ***
## Wrap              0.03193    0.01844   1.732 0.084661 .  
## PicoTRUE         -0.11835    0.04153  -2.850 0.004767 ** 
## ChickenTRUE      -1.17196    0.38372  -3.054 0.002518 ** 
## ShrimpTRUE       -0.23006    0.08972  -2.564 0.010969 *  
## CilantroTRUE      0.18586    0.19035   0.976 0.329888    
## OnionTRUE        -0.33842    0.18440  -1.835 0.067737 .  
## Meat:ChickenTRUE  0.25674    0.11018   2.330 0.020650 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2968 on 234 degrees of freedom
## Multiple R-squared:  0.8341, Adjusted R-squared:  0.8242 
## F-statistic: 84.03 on 14 and 234 DF,  p-value: < 2.2e-16

plot(model4$fitted.values ~ df$overall)
abline(0,1,col="red")

Also very slighty better. This is the model we’re going to go with. Let’s examine some of the stats:

Starting with the residual stats above, the median is pretty close to 0, and the first and third quartiles are farirly evenly appart, which suggests that the residuals are probably normally distributed. We’ll look into that more in a bit though.

The adjusted R-quared is 0.824, which implies that our model accounts for 82.4% of the vaiation in the quality of the burritos, adjusted for the number of variables we’re using. 82.4% is a pretty close fit.

The F-Statistic is greater than the critical F value for F here (1.734) at a signicance level of 0.05, which suggests the model is significant. Speaking of which, the p-value is as close to 0 as R will show, which is another good sign of significance.

As for the coefficients, it seems the biggest factor is the quality of the fillings (makes sense). Also, chicken burritos are relatively bad compared to others. Most of the variables are fairly significant, with p-values under 0.05 for all coefficients but wrap quality, and whether or not there are cilantro or onions.

Reiduals Analysis and Linear Assumptions

Let’s see if our assumptions for linear regression were met.

First off, I did look at each independent variable compared to the responce variable, and if there was any meaningful relationship, it appeared linear (I’ll spare you the 30 or so scatterplots).

As for independence, some burritos are from the same shop, and some are reviewed by the same reviewer. However, I’m willing to consider those as sort-of ‘hidden’ independent variables which affect the quality directy, instead of affecting other observations. It is possible that the amount of burritos reviews had (especially if they had several in one sitting) could affect observations, but I suspect that wasn’t generally the case, and there are many different reviewers

Now, let’s do some residual analysis to verify homoscedasticity and normality among residuals.

Below are plots of residuals vs fitted values, and a histogram of residuals

plot(model4, which = 1)

hist(model4$residuals, breaks = 20)

In the reiduals vs fitted plot, a few outliers are marked, but there is no discernable pattern to the residuals except between values 3.5 and 4.5. I’m not sure why it looks like that, perhaps it’s influence from one of the variables I removed, but overall I’d say things look fairly homoscedastic.

In the histogram, we can see that the residuals are pretty much normally distributed around 0; just as they should be.

Below is a Q-Q plot; another way we can examine residual normality

plot(model4, which = 2)

A fit along the dotted line is normal. We can see that it fits for the most part, but does tend to trail off a bi at either end. What this means is that we might have few more extreme values than we should given data from a normal distribution. I wonder if this is people ranking burritos as super high and super low, which is somthing known to happen with reviews whether or not it makes quantatative sense.

Let’s now look at a scale-location plot, which is another way in which we can gauge scedasticity. It’s pretty similar to a residuals vs fitted plot, but exemplifies it a bit more.

plot(model4, which = 3)

A horizontal red line with variance around it that does not change across fitted values implies homoscedasticity. One thing of note is there are definitely more values around moderate fitted values than extreme ones, but this is not nessicarily bad. My qualitative judgement of this is that it’s not perfectly homoscedastic, but is close enough.

Finally, let’s check if there are any datapoints that have undue influence over our regression.

plot(model4, which = 5)

In short, if a point has a Cook’s distance value over around 0.5, it’s having a large influence over the fit, and might be subject for removal. However, there do not appear to be any such datapoints.

Conclusions

OVerall, I would say that it is possible to model burrito quality with this dataset, and that linear regression is appropriate. Now we all know what kind of burrito we need to make next.

summary(model4)

## 
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings + 
##     `Meat:filling` + Meat:Chicken + Uniformity + Synergy + Wrap + 
##     Pico + Chicken + Shrimp + Cilantro + Onion, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55583 -0.16063 -0.00513  0.15138  1.14148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.80548    0.28892  -2.788 0.005740 ** 
## Temp              0.08140    0.02123   3.833 0.000162 ***
## I(Fillings^2)    -0.08272    0.02511  -3.294 0.001139 ** 
## Meat              0.21911    0.03262   6.717 1.39e-10 ***
## Fillings          0.77532    0.17478   4.436 1.41e-05 ***
## `Meat:filling`    0.12343    0.02390   5.164 5.18e-07 ***
## Uniformity        0.05204    0.02192   2.374 0.018385 *  
## Synergy           0.27838    0.03731   7.461 1.68e-12 ***
## Wrap              0.03193    0.01844   1.732 0.084661 .  
## PicoTRUE         -0.11835    0.04153  -2.850 0.004767 ** 
## ChickenTRUE      -1.17196    0.38372  -3.054 0.002518 ** 
## ShrimpTRUE       -0.23006    0.08972  -2.564 0.010969 *  
## CilantroTRUE      0.18586    0.19035   0.976 0.329888    
## OnionTRUE        -0.33842    0.18440  -1.835 0.067737 .  
## Meat:ChickenTRUE  0.25674    0.11018   2.330 0.020650 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2968 on 234 degrees of freedom
## Multiple R-squared:  0.8341, Adjusted R-squared:  0.8242 
## F-statistic: 84.03 on 14 and 234 DF,  p-value: < 2.2e-16

TLDR: \(Burrito Quality^*= -0.80548 + 0.081Temp^* -0.083Fillings^{*2}+0.775Fillings^* +0.219Meat ^* \\ +0.123MeatToFillingRatio+0.052Uniformity^* +0.278Synergy^*+0.032Wrap^* \\ -0.118PicoDeGallo^{**}-1.172Chicken^{**}-0.230Shrimp^{**}+0.186Cilantro^{**}-0.338Onion^{**} \\ +0.257Meat:Chicken^{***}\)

\(*\) Quality assessment from 1 to 5

\(**\) 1 or 0 categorical variable

\(***\) Interaction term

How to Make the Perfect Burrito

David Blumenstiel

11/16/2020