Datathon - GS Challenge

Introduction

Food brings us together. In this dataset we were provided with a list of establishments that have burritos and tacos on the menu.

Objective

Identify general trends
Answer questions for suppliers and consumers
If I were trying to get into the burrito and taco business, where should I open up shop? What should go in my dishes?
Here is my price range and my geographic location. Where should I go to get tacos?

Initial Analysis

First we read in the data and do some light cleaning with the resturant names that are mispelled or miscapitalized. Since this will be plotted, we drop rows missing location data.

## Warning: Missing column names filled in: 'X27' [27], 'X28' [28],
## 'X29' [29], 'X30' [30], 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34],
## 'X35' [35], 'X36' [36], 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40],
## 'X41' [41], 'X42' [42], 'X43' [43], 'X44' [44], 'X45' [45], 'X46' [46],
## 'X47' [47], 'X48' [48], 'X49' [49], 'X50' [50], 'X51' [51], 'X52' [52],
## 'X53' [53], 'X54' [54], 'X55' [55], 'X56' [56], 'X57' [57], 'X58' [58],
## 'X59' [59], 'X60' [60], 'X61' [61], 'X62' [62], 'X63' [63], 'X64' [64],
## 'X65' [65], 'X66' [66], 'X67' [67], 'X68' [68], 'X69' [69], 'X70' [70],
## 'X71' [71], 'X72' [72], 'X73' [73], 'X74' [74], 'X75' [75], 'X76' [76],
## 'X77' [77], 'X78' [78], 'X79' [79], 'X80' [80], 'X81' [81], 'X82' [82],
## 'X83' [83], 'X84' [84], 'X85' [85], 'X86' [86], 'X87' [87], 'X88' [88],
## 'X89' [89], 'X90' [90], 'X91' [91], 'X92' [92], 'X93' [93], 'X94' [94],
## 'X95' [95], 'X96' [96], 'X97' [97], 'X98' [98], 'X99' [99], 'X100' [100],
## 'X101' [101], 'X102' [102], 'X103' [103], 'X104' [104], 'X105' [105],
## 'X106' [106], 'X107' [107], 'X108' [108], 'X109' [109], 'X110' [110],
## 'X111' [111], 'X112' [112], 'X113' [113], 'X114' [114], 'X115' [115],
## 'X116' [116], 'X117' [117], 'X118' [118], 'X119' [119], 'X120' [120],
## 'X121' [121], 'X122' [122], 'X123' [123], 'X124' [124], 'X125' [125],
## 'X126' [126], 'X127' [127], 'X128' [128], 'X129' [129], 'X130' [130],
## 'X131' [131], 'X132' [132], 'X133' [133], 'X134' [134], 'X135' [135],
## 'X136' [136], 'X137' [137], 'X138' [138], 'X139' [139], 'X140' [140],
## 'X141' [141], 'X142' [142], 'X143' [143], 'X144' [144], 'X145' [145],
## 'X146' [146], 'X147' [147], 'X148' [148], 'X149' [149], 'X150' [150],
## 'X151' [151], 'X152' [152], 'X153' [153], 'X154' [154], 'X155' [155],
## 'X156' [156], 'X157' [157], 'X158' [158], 'X159' [159], 'X160' [160],
## 'X161' [161], 'X162' [162], 'X163' [163], 'X164' [164], 'X165' [165],
## 'X166' [166], 'X167' [167], 'X168' [168], 'X169' [169], 'X170' [170],
## 'X171' [171], 'X172' [172], 'X173' [173], 'X174' [174], 'X175' [175],
## 'X176' [176], 'X177' [177], 'X178' [178], 'X179' [179], 'X180' [180],
## 'X181' [181], 'X182' [182], 'X183' [183], 'X184' [184], 'X185' [185],
## 'X186' [186], 'X187' [187], 'X188' [188], 'X189' [189], 'X190' [190],
## 'X191' [191], 'X192' [192], 'X193' [193], 'X194' [194], 'X195' [195],
## 'X196' [196], 'X197' [197], 'X198' [198], 'X199' [199], 'X200' [200],
## 'X201' [201], 'X202' [202], 'X203' [203], 'X204' [204], 'X205' [205],
## 'X206' [206], 'X207' [207], 'X208' [208], 'X209' [209], 'X210' [210],
## 'X211' [211], 'X212' [212], 'X213' [213], 'X214' [214], 'X215' [215],
## 'X216' [216], 'X217' [217], 'X218' [218], 'X219' [219], 'X220' [220],
## 'X221' [221], 'X222' [222], 'X223' [223], 'X224' [224], 'X225' [225],
## 'X226' [226], 'X227' [227], 'X228' [228], 'X229' [229], 'X230' [230],
## 'X231' [231], 'X232' [232], 'X233' [233], 'X234' [234], 'X235' [235],
## 'X236' [236]

Plotting

First we plot all the locations of the items provided. This is done to gauge what kind of analysis or insight we can glean from the data. The ggmap library and Google Static Maps API are employed. The API code block is hidden to prevent public misuse. As a note, there will be a warning about removed rows. This is due to limiting the display of the plot to the United States, so there is no need to worry.

nums <- df[,c(10, 11)]
nums <- as.data.frame(sapply(nums, as.numeric))

map <- get_map(location = c(lon = mean(nums$longitude), lat = mean(nums$latitude)), zoom = 4, color = "bw")
ggmap(map) + 
  geom_point(data = nums, aes(x = nums$longitude, y = nums$latitude, fill = "red", alpha = 0.8), size = 2, shape = 21) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE) +
  ggtitle("Location of Given Restaurants")

## Warning: Removed 513 rows containing missing values (geom_point).

There is a good spread of restaurants, though it seems that the right side of the US is more dense. To get a better look, we adjust the parameters which results in a clearer plot.

ggmap(map) + 
  geom_point(data = df, aes(x = nums$longitude, y = nums$latitude), colour = "blue", alpha = 0.25, size = 0.5, shape = 21) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE) +
  ggtitle("Density of Given Restaurants")

## Warning: Removed 513 rows containing missing values (geom_point).

It appears there are many restaurants located in concentrated areas, likely metropolitan.

Let’s look at the states with the most restaurants serving tacos and/or burritos.

# ref : https://stackoverflow.com/questions/8751497/latitude-longitude-coordinates-to-state-code-in-r
library(sp)
library(maps)
library(maptools)

latlong2state <- function(pointsDF) {
    states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
    IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
    states_sp <- map2SpatialPolygons(states, IDs=IDs,
                     proj4string=CRS("+proj=longlat +datum=WGS84"))

    pointsSP <- SpatialPoints(pointsDF, 
                    proj4string=CRS("+proj=longlat +datum=WGS84"))

    indices <- over(pointsSP, states_sp)

    stateNames <- sapply(states_sp@polygons, function(x) x@ID)
    stateNames[indices]
}

# convert to state
pts <- data.frame(df$longitude, df$latitude)
df$state <- latlong2state(pts)
count <- as.data.frame(table(df$state))
count <- count[order(-count$Freq),]
names(count)[1] <- "State"
names(count)[2] <- "Number of items"
count[1:5,]

##         State Number of items
## 4  california           13813
## 42      texas            4678
## 9     florida            2182
## 12   illinois            2135
## 31   new york            2073

From this chart we can see that burritos and tacos seem pretty popular in California, nearly triple that in Texas.

Clustering

For clustering the obvious way would be geographically by region, and the hierarchical and k-means approaches yield similar results.

Hierarchical

This works for a smaller set of data, but is limited by the size of the distance matrix. The code is included for reference, but not correct due to sampling.

# ref : https://gis.stackexchange.com/questions/17638/clustering-spatial-data-in-r
library(sp)
library(rgdal)
library(geosphere)

x <- nums$longitude[1:100]
y <- nums$latitude[1:100]

xy <- SpatialPointsDataFrame(
  matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
  proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

# get geodesic distance matrix
mdist <- distm(xy)

# cluster all points using a hierarchical clustering approach
hc <- hclust(as.dist(mdist), method="complete")

plot(hc)

To visualize the hierachical clustering, we created a dendrogram. From the dendrogram we determine that 5 clusters is a good limit and cut the tree to generate them.

Note: The following code no longer works for some reason but rest assured it was not as accurate as K Means anyway.

# # define 5 clusters and add them to the spatial df
# xy$clust <- cutree(hc, k = 5)
# 
# # add cluster number back to original dataframe
# nums$hc <- xy$clust
# nums$hc <- as.factor(nums$hc)
# 
# # plot by color
# ggmap(map) +
#   geom_point(data = df, aes(x = nums$longitude, y = nums$latitude, fill = nums$hc, alpha = 0.8), size = 5, shape = 21) +
#   guides(fill=FALSE, alpha=FALSE, size=FALSE) +
#   scale_fill_manual(breaks = c("1", "2", "3", "4", "5"),
#                     values=c("red", "blue", "green", "yellow", "black")) +
#   ggtitle("H Clusters")

K Means

library(fpc)
kclus <- kmeans(cbind(nums$longitude, nums$latitude), centers = 5)
nums$kc <- kclus$cluster
nums$kc <- as.factor(nums$kc) 

# plot
ggmap(map) + 
  geom_point(data = df, aes(x = nums$longitude, y = nums$latitude, fill = nums$kc, alpha = 0.25), size = 0.5, shape = 21) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE) +
  scale_fill_manual(breaks = c("1", "2", "3", "4", "5"),
                    values=c("red", "blue", "green", "yellow", "black"))+
  ggtitle("K Clusters")

## Warning: Removed 513 rows containing missing values (geom_point).

# add k clusters to original
df$kc <- nums$kc

K means clusters seem cleaner than the hierachical ones and better match geographic regions. They are added in a cluster column to the original dataframe.

General Insights

** There is a pretty good balance of taco and burrito items in this dataset. The first value is the number of taco items and the second is for the burritos, and they are found by going through dish names on the menus.

taco <- grep("taco", df$menus.name, ignore.case=TRUE)
burrito <- grep("burrito", df$menus.name, ignore.case=TRUE)
length(taco)

## [1] 29907

length(burrito)

## [1] 25931

Who is feeding America burritos and tacos?

Most numerous restaurants nationally

# order restaurant name by frequency
name <- df$name
name <- as.data.frame(table(name))
name <- name[order(-name$Freq),]
name[1:5,]

##                     name Freq
## 1459 Chili's Grill & Bar 7177
## 6156      Sonic Drive In 3127
## 6434           Taco Bell 2271
## 4412          McDonald's 1938
## 3334     Jack in the Box  747

Top Burrito Restaurants

b <- df[burrito, "name"]
b <- as.data.frame(table(b))
b <- b[order(-b$Freq),]
names(b)[1] <- "Name"
head(b)

##                     Name Freq
## 4129      Sonic Drive In 3127
## 988  Chili's Grill & Bar 2525
## 2980          McDonald's 1935
## 4309           Taco Bell  653
## 3877             Rubio's  395
## 1600       El Pollo Loco  342

Top Taco Restaurants

t <- df[taco, "name"]
t <- as.data.frame(table(t))
t <- t[order(-t$Freq),]
names(t)[1] <- "Name"
head(t)

##                     Name Freq
## 1031 Chili's Grill & Bar 4652
## 4691           Taco Bell 1618
## 2393     Jack in the Box  711
## 4250             Rubio's  343
## 1645       El Pollo Loco  331
## 306           Baja Fresh  242

Top by Region (Cluster)

SE <- which(df$kc == "1")
W <- which(df$kc == "2")
MW <- which(df$kc == "3")
NE <- which(df$kc == "4")
C <- which(df$kc == "5")

# overall
temp <- df[SE, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bSE <- temp
names(bSE)[1] <- "Southeast"

temp <- df[W, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bW <- temp
names(bW)[1] <- "West"

temp <- df[MW, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bMW <- temp
names(bMW)[1] <- "Midwest"

temp <- df[NE, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bNE <- temp
names(bNE)[1] <- "Northeast"

temp <- df[C, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bC <- temp
names(bC)[1] <- "Central"

head(bSE)

##                Southeast Freq
## 531  Chili's Grill & Bar 1212
## 2137             Rubio's  717
## 858        El Pollo Loco  661
## 1216     Jack in the Box  463
## 147           Baja Fresh  455
## 155   Baker's Drive Thru  389

head(bW)

##                       West Freq
## 289    Chili's Grill & Bar 1342
## 1156        Sonic Drive In  688
## 1205             Taco Bell  555
## 852             McDonald's  489
## 1147         Skyline Chili  198
## 257  Casey's General Store  107

head(bMW)

##                    Midwest Freq
## 189    Chili's Grill & Bar 1900
## 733         Sonic Drive In  802
## 765              Taco Bell  510
## 530             McDonald's  395
## 346 Frontera Mex-Mex Grill   62
## 54              Barberitos   58

head(bNE)

##                     Northeast Freq
## 349       Chili's Grill & Bar 1156
## 1489                Taco Bell  577
## 1020               McDonald's  447
## 1416           Sonic Drive In  164
## 113  Baja Fresh Mexican Grill   99
## 494       Dos Amigos Burritos   93

head(bC)

##                     Central Freq
## 167     Chili's Grill & Bar 1567
## 839          Sonic Drive In 1212
## 629              McDonald's  307
## 874               Taco Bell  276
## 456         Jack in the Box  234
## 362 Freebirds World Burrito  121

by City

city <- df$city
city <- as.data.frame(table(city))
city <- city[order(-city$Freq),]
city[1:5,]

##               city Freq
## 2645     San Diego  879
## 1709   Los Angeles  819
## 531        Chicago  671
## 2080      New York  649
## 2648 San Francisco  605

Further Analysis

** ### What is in these tacos and burritos? To figure this out, we go through the menu descriptions and find the most used meaningful words.

# ref : https://stackoverflow.com/questions/26159754/using-r-to-find-top-ten-words-in-a-text

txt <- df$menus.description
txt <- na.omit(txt)

# for qdap library
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_221') 
library(qdap)
ft <- freq_terms(txt, 10, stopwords = Top100Words)
plot(ft)

Cheese is something that is widely mentioned in dish descriptions. Seems like our restaurateur can’t go wrong with a dish made from cheese, tortilla, and beans. Unless something they get a little too excited and try out their treasured burrito smoothie recipe which is clearly meant for a niche audience. But that is outside of the scope of this analysis and we digress.

Conclusion

While we did not meet all of our objectives, here is what can still be done. We can examine the prices in the different regions and maybe try to identify regional preferences for ingredients. We can filter restaurants by overall prices for consumer interest. Since we used R, we can create a Shiny app for the consumer to interact with the data.