Burrito Perfection
We all like burritos, but what makes them so good? Could it be the meat, the size, the quality of the salsa? Here, we will use multiple regression to attempt to answer the age old question: how do I make the best burrito?
Data Prep
Many have come before us. Some even left nice datasets behind: https://srcole.github.io/100burritos/. We will use this 2016 dataset of burritos in San Diego to help us answer our question.
Importing
Cleaning
## Warning: Duplicated column names deduplicated: 'Salsa' => 'Salsa_1' [50]
#Data Prep
#Getting rid of some stuff
df[,c(1,3,4,5,6,7,8,9,12,13,27,28,29,30,31)] <- NULL
#Changing to True/False where appropriate
df[,17:51] <- !is.na(df[,17:51])
#Tossing incomplete observations that I don't want to deal with
df <- df %>%
drop_na(c(Length,Circum,Volume,Cost,Hunger,overall,Meat,Temp,Fillings,Uniformity,Salsa,Wrap))
#Getting rid of a few more variables which don't have many TRUE cases
df[,c("Tomato","Taquito","Cabbage","Carrots","Fish","Lettuce","Bell peper","Salsa_1","Ham","Pineapple","Chile relleno","Nopales","Lobster","Queso","Egg","Mushroom","Bacon","Sushi","Corn","Zucchini")] <- NULL
summary(df)
## Burrito Cost Hunger Length
## Length:249 Min. : 3.990 Min. :1.00 Min. :15.00
## Class :character 1st Qu.: 6.250 1st Qu.:3.00 1st Qu.:18.50
## Mode :character Median : 7.000 Median :3.50 Median :20.00
## Mean : 7.166 Mean :3.51 Mean :19.99
## 3rd Qu.: 7.890 3rd Qu.:4.00 3rd Qu.:21.50
## Max. :25.000 Max. :5.00 Max. :26.00
## Circum Volume Tortilla Temp
## Min. :17.00 Min. :0.4000 Min. :1.000 Min. :1.000
## 1st Qu.:21.00 1st Qu.:0.6800 1st Qu.:3.000 1st Qu.:3.000
## Median :22.00 Median :0.7700 Median :3.500 Median :4.000
## Mean :22.15 Mean :0.7859 Mean :3.514 Mean :3.739
## 3rd Qu.:23.00 3rd Qu.:0.8800 3rd Qu.:4.000 3rd Qu.:4.500
## Max. :29.00 Max. :1.5400 Max. :5.000 Max. :5.000
## Meat Fillings Meat:filling Uniformity
## Min. :1.500 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.500
## Median :3.800 Median :3.500 Median :3.500 Median :3.500
## Mean :3.608 Mean :3.493 Mean :3.474 Mean :3.348
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Salsa Synergy Wrap overall
## Min. :1.000 Min. :1.000 Min. :0.500 Min. :1.500
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.500 1st Qu.:3.200
## Median :3.500 Median :3.800 Median :4.000 Median :3.800
## Mean :3.405 Mean :3.574 Mean :3.956 Mean :3.589
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Beef Pico Guac Cheese
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:150 FALSE:163 FALSE:162 FALSE:157
## TRUE :99 TRUE :86 TRUE :87 TRUE :92
##
##
##
## Fries Sour cream Pork Chicken
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:173 FALSE:191 FALSE:219 FALSE:237
## TRUE :76 TRUE :58 TRUE :30 TRUE :12
##
##
##
## Shrimp Rice Beans Sauce
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:236 FALSE:229 FALSE:240 FALSE:230
## TRUE :13 TRUE :20 TRUE :9 TRUE :19
##
##
##
## Cilantro Onion Avocado
## Mode :logical Mode :logical Mode :logical
## FALSE:239 FALSE:238 FALSE:241
## TRUE :10 TRUE :11 TRUE :8
##
##
##
After cleaning the data, we are left with 31 variables. Many variables, including the responce variable ‘overall’, are scores ranging from 1 to 5; we’ll consider these as continuous data. There are also quite a few categorical variables which detail some of what the burritos contain. With all this, we should be able to make some models.
Modeling
Let’s start out with a simple linear model which considers all of the variables we have available. The quality, represented by ‘overall’ of a burrito ranges from 1 at lowest to 5 at highest.
model <- lm(overall ~ Cost + #I didn't put the type of borrito in here: it had little effect and made lot's of extra coefficients
Hunger +
Length +
Circum +
Volume +
Tortilla +
Temp +
Meat +
Fillings +
`Meat:filling` +
Uniformity +
Salsa +
Synergy +
Wrap +
Beef +
Pico +
Guac +
Cheese +
Fries +
`Sour cream` +
Pork +
Chicken +
Shrimp +
Rice +
Beans +
Sauce +
Cilantro +
Onion +
Avocado,
data=df)
summary(model)
##
## Call:
## lm(formula = overall ~ Cost + Hunger + Length + Circum + Volume +
## Tortilla + Temp + Meat + Fillings + `Meat:filling` + Uniformity +
## Salsa + Synergy + Wrap + Beef + Pico + Guac + Cheese + Fries +
## `Sour cream` + Pork + Chicken + Shrimp + Rice + Beans + Sauce +
## Cilantro + Onion + Avocado, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32647 -0.16054 0.01168 0.15164 1.14428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.168164 1.798258 -1.206 0.229233
## Cost 0.009699 0.013125 0.739 0.460703
## Hunger 0.061112 0.028483 2.146 0.033011 *
## Length 0.042572 0.044434 0.958 0.339070
## Circum 0.070036 0.080364 0.871 0.384444
## Volume -0.681193 1.094136 -0.623 0.534205
## Tortilla 0.014968 0.029942 0.500 0.617649
## Temp 0.077866 0.023237 3.351 0.000948 ***
## Meat 0.264482 0.034522 7.661 5.88e-13 ***
## Fillings 0.202161 0.038149 5.299 2.83e-07 ***
## `Meat:filling` 0.126618 0.026183 4.836 2.50e-06 ***
## Uniformity 0.062510 0.023255 2.688 0.007739 **
## Salsa -0.039165 0.027835 -1.407 0.160826
## Synergy 0.260899 0.039201 6.655 2.23e-10 ***
## Wrap 0.053747 0.020302 2.647 0.008702 **
## BeefTRUE -0.066410 0.104181 -0.637 0.524497
## PicoTRUE -0.178837 0.063325 -2.824 0.005179 **
## GuacTRUE 0.098351 0.069087 1.424 0.155989
## CheeseTRUE 0.057516 0.108505 0.530 0.596597
## FriesTRUE 0.074103 0.098034 0.756 0.450525
## `Sour cream`TRUE 0.008481 0.076047 0.112 0.911304
## PorkTRUE -0.113465 0.106840 -1.062 0.289402
## ChickenTRUE -0.475406 0.138285 -3.438 0.000702 ***
## ShrimpTRUE -0.421615 0.188868 -2.232 0.026606 *
## RiceTRUE -0.081457 0.109092 -0.747 0.456057
## BeansTRUE 0.052226 0.145403 0.359 0.719804
## SauceTRUE 0.167387 0.149789 1.117 0.265011
## CilantroTRUE 0.336391 0.216008 1.557 0.120841
## OnionTRUE -0.391410 0.210758 -1.857 0.064631 .
## AvocadoTRUE 0.100948 0.197943 0.510 0.610574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.302 on 219 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.818
## F-statistic: 39.43 on 29 and 219 DF, p-value: < 2.2e-16
Honestly, I didn’t expect it to work so well on the first try. Below is a plot of the fitted values against the actual ones. As you can see, they are pretty close.
Let’s tweak the model a bit by knocking out some of the not-so-significant coefficients.
model2 <- lm(overall ~ Temp +
Meat +
Fillings +
`Meat:filling` +
Uniformity +
Synergy +
Wrap +
Pico +
Chicken +
Shrimp +
Cilantro +
Onion,
data=df)
summary(model2)
##
## Call:
## lm(formula = overall ~ Temp + Meat + Fillings + `Meat:filling` +
## Uniformity + Synergy + Wrap + Pico + Chicken + Shrimp + Cilantro +
## Onion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5236 -0.1683 0.0088 0.1542 1.1670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05805 0.14318 -0.405 0.685549
## Temp 0.08244 0.02198 3.750 0.000222 ***
## Meat 0.23189 0.03292 7.044 2.03e-11 ***
## Fillings 0.23632 0.03712 6.367 9.95e-10 ***
## `Meat:filling` 0.11429 0.02448 4.669 5.09e-06 ***
## Uniformity 0.06214 0.02249 2.763 0.006174 **
## Synergy 0.26988 0.03765 7.169 9.66e-12 ***
## Wrap 0.04679 0.01869 2.503 0.012997 *
## PicoTRUE -0.11972 0.04277 -2.799 0.005545 **
## ChickenTRUE -0.39372 0.09404 -4.187 4.00e-05 ***
## ShrimpTRUE -0.20541 0.09238 -2.224 0.027124 *
## CilantroTRUE 0.33484 0.19374 1.728 0.085246 .
## OnionTRUE -0.46496 0.18838 -2.468 0.014287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3072 on 236 degrees of freedom
## Multiple R-squared: 0.8207, Adjusted R-squared: 0.8116
## F-statistic: 90.01 on 12 and 236 DF, p-value: < 2.2e-16
Much simpler now, with pretty similar stats aside from the F-statistic, which went way up. Now, for this assignment, we also need a quadratic term.
model3 <- lm(overall ~ Temp +
I(Fillings^2) + # The I() function will let you do quadratic terms specified within the lm function
Meat +
Fillings +
`Meat:filling` +
Uniformity +
Synergy +
Wrap +
Pico +
Chicken +
Shrimp +
Cilantro +
Onion,
data=df)
summary(model3)
##
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings +
## `Meat:filling` + Uniformity + Synergy + Wrap + Pico + Chicken +
## Shrimp + Cilantro + Onion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55848 -0.15079 -0.00374 0.15549 1.14394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.95801 0.28404 -3.373 0.000870 ***
## Temp 0.08164 0.02143 3.809 0.000178 ***
## I(Fillings^2) -0.09122 0.02507 -3.638 0.000338 ***
## Meat 0.23585 0.03212 7.344 3.39e-12 ***
## Fillings 0.85260 0.17321 4.922 1.61e-06 ***
## `Meat:filling` 0.11531 0.02387 4.831 2.45e-06 ***
## Uniformity 0.05837 0.02195 2.659 0.008372 **
## Synergy 0.25998 0.03681 7.063 1.83e-11 ***
## Wrap 0.03330 0.01860 1.790 0.074661 .
## PicoTRUE -0.12706 0.04175 -3.044 0.002604 **
## ChickenTRUE -0.30507 0.09488 -3.216 0.001485 **
## ShrimpTRUE -0.23757 0.09050 -2.625 0.009235 **
## CilantroTRUE 0.21707 0.19166 1.133 0.258541
## OnionTRUE -0.36168 0.18586 -1.946 0.052844 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2996 on 235 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8209
## F-statistic: 88.41 on 13 and 235 DF, p-value: < 2.2e-16
Slightly better having added an additional term for quadratic Fillings. Now let’s add an interaction term and see what that does.
model4 <- lm(overall ~ Temp +
I(Fillings^2) + # The I() function will let you do quadratic terms specified within the lm function
Meat +
Fillings +
`Meat:filling` +
Meat:Chicken +
Uniformity +
Synergy +
Wrap +
Pico +
Chicken +
Shrimp +
Cilantro +
Onion,
data=df)
summary(model4)
##
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings +
## `Meat:filling` + Meat:Chicken + Uniformity + Synergy + Wrap +
## Pico + Chicken + Shrimp + Cilantro + Onion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55583 -0.16063 -0.00513 0.15138 1.14148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.80548 0.28892 -2.788 0.005740 **
## Temp 0.08140 0.02123 3.833 0.000162 ***
## I(Fillings^2) -0.08272 0.02511 -3.294 0.001139 **
## Meat 0.21911 0.03262 6.717 1.39e-10 ***
## Fillings 0.77532 0.17478 4.436 1.41e-05 ***
## `Meat:filling` 0.12343 0.02390 5.164 5.18e-07 ***
## Uniformity 0.05204 0.02192 2.374 0.018385 *
## Synergy 0.27838 0.03731 7.461 1.68e-12 ***
## Wrap 0.03193 0.01844 1.732 0.084661 .
## PicoTRUE -0.11835 0.04153 -2.850 0.004767 **
## ChickenTRUE -1.17196 0.38372 -3.054 0.002518 **
## ShrimpTRUE -0.23006 0.08972 -2.564 0.010969 *
## CilantroTRUE 0.18586 0.19035 0.976 0.329888
## OnionTRUE -0.33842 0.18440 -1.835 0.067737 .
## Meat:ChickenTRUE 0.25674 0.11018 2.330 0.020650 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2968 on 234 degrees of freedom
## Multiple R-squared: 0.8341, Adjusted R-squared: 0.8242
## F-statistic: 84.03 on 14 and 234 DF, p-value: < 2.2e-16
Also very slighty better. This is the model we’re going to go with. Let’s examine some of the stats:
Starting with the residual stats above, the median is pretty close to 0, and the first and third quartiles are farirly evenly appart, which suggests that the residuals are probably normally distributed. We’ll look into that more in a bit though.
The adjusted R-quared is 0.824, which implies that our model accounts for 82.4% of the vaiation in the quality of the burritos, adjusted for the number of variables we’re using. 82.4% is a pretty close fit.
The F-Statistic is greater than the critical F value for F here (1.734) at a signicance level of 0.05, which suggests the model is significant. Speaking of which, the p-value is as close to 0 as R will show, which is another good sign of significance.
As for the coefficients, it seems the biggest factor is the quality of the fillings (makes sense). Also, chicken burritos are relatively bad compared to others. Most of the variables are fairly significant, with p-values under 0.05 for all coefficients but wrap quality, and whether or not there are cilantro or onions.
Reiduals Analysis and Linear Assumptions
Let’s see if our assumptions for linear regression were met.
First off, I did look at each independent variable compared to the responce variable, and if there was any meaningful relationship, it appeared linear (I’ll spare you the 30 or so scatterplots).
As for independence, some burritos are from the same shop, and some are reviewed by the same reviewer. However, I’m willing to consider those as sort-of ‘hidden’ independent variables which affect the quality directy, instead of affecting other observations. It is possible that the amount of burritos reviews had (especially if they had several in one sitting) could affect observations, but I suspect that wasn’t generally the case, and there are many different reviewers
Now, let’s do some residual analysis to verify homoscedasticity and normality among residuals.
Below are plots of residuals vs fitted values, and a histogram of residuals
In the reiduals vs fitted plot, a few outliers are marked, but there is no discernable pattern to the residuals except between values 3.5 and 4.5. I’m not sure why it looks like that, perhaps it’s influence from one of the variables I removed, but overall I’d say things look fairly homoscedastic.
In the histogram, we can see that the residuals are pretty much normally distributed around 0; just as they should be.
Below is a Q-Q plot; another way we can examine residual normality
A fit along the dotted line is normal. We can see that it fits for the most part, but does tend to trail off a bi at either end. What this means is that we might have few more extreme values than we should given data from a normal distribution. I wonder if this is people ranking burritos as super high and super low, which is somthing known to happen with reviews whether or not it makes quantatative sense.
Let’s now look at a scale-location plot, which is another way in which we can gauge scedasticity. It’s pretty similar to a residuals vs fitted plot, but exemplifies it a bit more.
A horizontal red line with variance around it that does not change across fitted values implies homoscedasticity. One thing of note is there are definitely more values around moderate fitted values than extreme ones, but this is not nessicarily bad. My qualitative judgement of this is that it’s not perfectly homoscedastic, but is close enough.
Finally, let’s check if there are any datapoints that have undue influence over our regression.
In short, if a point has a Cook’s distance value over around 0.5, it’s having a large influence over the fit, and might be subject for removal. However, there do not appear to be any such datapoints.
Conclusions
OVerall, I would say that it is possible to model burrito quality with this dataset, and that linear regression is appropriate. Now we all know what kind of burrito we need to make next.
##
## Call:
## lm(formula = overall ~ Temp + I(Fillings^2) + Meat + Fillings +
## `Meat:filling` + Meat:Chicken + Uniformity + Synergy + Wrap +
## Pico + Chicken + Shrimp + Cilantro + Onion, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55583 -0.16063 -0.00513 0.15138 1.14148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.80548 0.28892 -2.788 0.005740 **
## Temp 0.08140 0.02123 3.833 0.000162 ***
## I(Fillings^2) -0.08272 0.02511 -3.294 0.001139 **
## Meat 0.21911 0.03262 6.717 1.39e-10 ***
## Fillings 0.77532 0.17478 4.436 1.41e-05 ***
## `Meat:filling` 0.12343 0.02390 5.164 5.18e-07 ***
## Uniformity 0.05204 0.02192 2.374 0.018385 *
## Synergy 0.27838 0.03731 7.461 1.68e-12 ***
## Wrap 0.03193 0.01844 1.732 0.084661 .
## PicoTRUE -0.11835 0.04153 -2.850 0.004767 **
## ChickenTRUE -1.17196 0.38372 -3.054 0.002518 **
## ShrimpTRUE -0.23006 0.08972 -2.564 0.010969 *
## CilantroTRUE 0.18586 0.19035 0.976 0.329888
## OnionTRUE -0.33842 0.18440 -1.835 0.067737 .
## Meat:ChickenTRUE 0.25674 0.11018 2.330 0.020650 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2968 on 234 degrees of freedom
## Multiple R-squared: 0.8341, Adjusted R-squared: 0.8242
## F-statistic: 84.03 on 14 and 234 DF, p-value: < 2.2e-16
TLDR: \(Burrito Quality^*= -0.80548 + 0.081Temp^* -0.083Fillings^{*2}+0.775Fillings^* +0.219Meat ^* \\ +0.123MeatToFillingRatio+0.052Uniformity^* +0.278Synergy^*+0.032Wrap^* \\ -0.118PicoDeGallo^{**}-1.172Chicken^{**}-0.230Shrimp^{**}+0.186Cilantro^{**}-0.338Onion^{**} \\ +0.257Meat:Chicken^{***}\)
\(*\) Quality assessment from 1 to 5
\(**\) 1 or 0 categorical variable
\(***\) Interaction term