Food brings us together. In this dataset we were provided with a list of establishments that have burritos and tacos on the menu.
First we read in the data and do some light cleaning with the resturant names that are mispelled or miscapitalized. Since this will be plotted, we drop rows missing location data.
## Warning: Missing column names filled in: 'X27' [27], 'X28' [28],
## 'X29' [29], 'X30' [30], 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34],
## 'X35' [35], 'X36' [36], 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40],
## 'X41' [41], 'X42' [42], 'X43' [43], 'X44' [44], 'X45' [45], 'X46' [46],
## 'X47' [47], 'X48' [48], 'X49' [49], 'X50' [50], 'X51' [51], 'X52' [52],
## 'X53' [53], 'X54' [54], 'X55' [55], 'X56' [56], 'X57' [57], 'X58' [58],
## 'X59' [59], 'X60' [60], 'X61' [61], 'X62' [62], 'X63' [63], 'X64' [64],
## 'X65' [65], 'X66' [66], 'X67' [67], 'X68' [68], 'X69' [69], 'X70' [70],
## 'X71' [71], 'X72' [72], 'X73' [73], 'X74' [74], 'X75' [75], 'X76' [76],
## 'X77' [77], 'X78' [78], 'X79' [79], 'X80' [80], 'X81' [81], 'X82' [82],
## 'X83' [83], 'X84' [84], 'X85' [85], 'X86' [86], 'X87' [87], 'X88' [88],
## 'X89' [89], 'X90' [90], 'X91' [91], 'X92' [92], 'X93' [93], 'X94' [94],
## 'X95' [95], 'X96' [96], 'X97' [97], 'X98' [98], 'X99' [99], 'X100' [100],
## 'X101' [101], 'X102' [102], 'X103' [103], 'X104' [104], 'X105' [105],
## 'X106' [106], 'X107' [107], 'X108' [108], 'X109' [109], 'X110' [110],
## 'X111' [111], 'X112' [112], 'X113' [113], 'X114' [114], 'X115' [115],
## 'X116' [116], 'X117' [117], 'X118' [118], 'X119' [119], 'X120' [120],
## 'X121' [121], 'X122' [122], 'X123' [123], 'X124' [124], 'X125' [125],
## 'X126' [126], 'X127' [127], 'X128' [128], 'X129' [129], 'X130' [130],
## 'X131' [131], 'X132' [132], 'X133' [133], 'X134' [134], 'X135' [135],
## 'X136' [136], 'X137' [137], 'X138' [138], 'X139' [139], 'X140' [140],
## 'X141' [141], 'X142' [142], 'X143' [143], 'X144' [144], 'X145' [145],
## 'X146' [146], 'X147' [147], 'X148' [148], 'X149' [149], 'X150' [150],
## 'X151' [151], 'X152' [152], 'X153' [153], 'X154' [154], 'X155' [155],
## 'X156' [156], 'X157' [157], 'X158' [158], 'X159' [159], 'X160' [160],
## 'X161' [161], 'X162' [162], 'X163' [163], 'X164' [164], 'X165' [165],
## 'X166' [166], 'X167' [167], 'X168' [168], 'X169' [169], 'X170' [170],
## 'X171' [171], 'X172' [172], 'X173' [173], 'X174' [174], 'X175' [175],
## 'X176' [176], 'X177' [177], 'X178' [178], 'X179' [179], 'X180' [180],
## 'X181' [181], 'X182' [182], 'X183' [183], 'X184' [184], 'X185' [185],
## 'X186' [186], 'X187' [187], 'X188' [188], 'X189' [189], 'X190' [190],
## 'X191' [191], 'X192' [192], 'X193' [193], 'X194' [194], 'X195' [195],
## 'X196' [196], 'X197' [197], 'X198' [198], 'X199' [199], 'X200' [200],
## 'X201' [201], 'X202' [202], 'X203' [203], 'X204' [204], 'X205' [205],
## 'X206' [206], 'X207' [207], 'X208' [208], 'X209' [209], 'X210' [210],
## 'X211' [211], 'X212' [212], 'X213' [213], 'X214' [214], 'X215' [215],
## 'X216' [216], 'X217' [217], 'X218' [218], 'X219' [219], 'X220' [220],
## 'X221' [221], 'X222' [222], 'X223' [223], 'X224' [224], 'X225' [225],
## 'X226' [226], 'X227' [227], 'X228' [228], 'X229' [229], 'X230' [230],
## 'X231' [231], 'X232' [232], 'X233' [233], 'X234' [234], 'X235' [235],
## 'X236' [236]
First we plot all the locations of the items provided. This is done to gauge what kind of analysis or insight we can glean from the data. The ggmap library and Google Static Maps API are employed. The API code block is hidden to prevent public misuse. As a note, there will be a warning about removed rows. This is due to limiting the display of the plot to the United States, so there is no need to worry.
nums <- df[,c(10, 11)]
nums <- as.data.frame(sapply(nums, as.numeric))
map <- get_map(location = c(lon = mean(nums$longitude), lat = mean(nums$latitude)), zoom = 4, color = "bw")
ggmap(map) +
geom_point(data = nums, aes(x = nums$longitude, y = nums$latitude, fill = "red", alpha = 0.8), size = 2, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE) +
ggtitle("Location of Given Restaurants")
## Warning: Removed 513 rows containing missing values (geom_point).
There is a good spread of restaurants, though it seems that the right side of the US is more dense. To get a better look, we adjust the parameters which results in a clearer plot.
ggmap(map) +
geom_point(data = df, aes(x = nums$longitude, y = nums$latitude), colour = "blue", alpha = 0.25, size = 0.5, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE) +
ggtitle("Density of Given Restaurants")
## Warning: Removed 513 rows containing missing values (geom_point).
It appears there are many restaurants located in concentrated areas, likely metropolitan.
Let’s look at the states with the most restaurants serving tacos and/or burritos.
# ref : https://stackoverflow.com/questions/8751497/latitude-longitude-coordinates-to-state-code-in-r
library(sp)
library(maps)
library(maptools)
latlong2state <- function(pointsDF) {
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
pointsSP <- SpatialPoints(pointsDF,
proj4string=CRS("+proj=longlat +datum=WGS84"))
indices <- over(pointsSP, states_sp)
stateNames <- sapply(states_sp@polygons, function(x) x@ID)
stateNames[indices]
}
# convert to state
pts <- data.frame(df$longitude, df$latitude)
df$state <- latlong2state(pts)
count <- as.data.frame(table(df$state))
count <- count[order(-count$Freq),]
names(count)[1] <- "State"
names(count)[2] <- "Number of items"
count[1:5,]
## State Number of items
## 4 california 13813
## 42 texas 4678
## 9 florida 2182
## 12 illinois 2135
## 31 new york 2073
From this chart we can see that burritos and tacos seem pretty popular in California, nearly triple that in Texas.
For clustering the obvious way would be geographically by region, and the hierarchical and k-means approaches yield similar results.
This works for a smaller set of data, but is limited by the size of the distance matrix. The code is included for reference, but not correct due to sampling.
# ref : https://gis.stackexchange.com/questions/17638/clustering-spatial-data-in-r
library(sp)
library(rgdal)
library(geosphere)
x <- nums$longitude[1:100]
y <- nums$latitude[1:100]
xy <- SpatialPointsDataFrame(
matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))
# get geodesic distance matrix
mdist <- distm(xy)
# cluster all points using a hierarchical clustering approach
hc <- hclust(as.dist(mdist), method="complete")
plot(hc)
To visualize the hierachical clustering, we created a dendrogram. From the dendrogram we determine that 5 clusters is a good limit and cut the tree to generate them.
Note: The following code no longer works for some reason but rest assured it was not as accurate as K Means anyway.
# # define 5 clusters and add them to the spatial df
# xy$clust <- cutree(hc, k = 5)
#
# # add cluster number back to original dataframe
# nums$hc <- xy$clust
# nums$hc <- as.factor(nums$hc)
#
# # plot by color
# ggmap(map) +
# geom_point(data = df, aes(x = nums$longitude, y = nums$latitude, fill = nums$hc, alpha = 0.8), size = 5, shape = 21) +
# guides(fill=FALSE, alpha=FALSE, size=FALSE) +
# scale_fill_manual(breaks = c("1", "2", "3", "4", "5"),
# values=c("red", "blue", "green", "yellow", "black")) +
# ggtitle("H Clusters")
library(fpc)
kclus <- kmeans(cbind(nums$longitude, nums$latitude), centers = 5)
nums$kc <- kclus$cluster
nums$kc <- as.factor(nums$kc)
# plot
ggmap(map) +
geom_point(data = df, aes(x = nums$longitude, y = nums$latitude, fill = nums$kc, alpha = 0.25), size = 0.5, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE) +
scale_fill_manual(breaks = c("1", "2", "3", "4", "5"),
values=c("red", "blue", "green", "yellow", "black"))+
ggtitle("K Clusters")
## Warning: Removed 513 rows containing missing values (geom_point).
# add k clusters to original
df$kc <- nums$kc
K means clusters seem cleaner than the hierachical ones and better match geographic regions. They are added in a cluster column to the original dataframe.
** There is a pretty good balance of taco and burrito items in this dataset. The first value is the number of taco items and the second is for the burritos, and they are found by going through dish names on the menus.
taco <- grep("taco", df$menus.name, ignore.case=TRUE)
burrito <- grep("burrito", df$menus.name, ignore.case=TRUE)
length(taco)
## [1] 29907
length(burrito)
## [1] 25931
# order restaurant name by frequency
name <- df$name
name <- as.data.frame(table(name))
name <- name[order(-name$Freq),]
name[1:5,]
## name Freq
## 1459 Chili's Grill & Bar 7177
## 6156 Sonic Drive In 3127
## 6434 Taco Bell 2271
## 4412 McDonald's 1938
## 3334 Jack in the Box 747
b <- df[burrito, "name"]
b <- as.data.frame(table(b))
b <- b[order(-b$Freq),]
names(b)[1] <- "Name"
head(b)
## Name Freq
## 4129 Sonic Drive In 3127
## 988 Chili's Grill & Bar 2525
## 2980 McDonald's 1935
## 4309 Taco Bell 653
## 3877 Rubio's 395
## 1600 El Pollo Loco 342
t <- df[taco, "name"]
t <- as.data.frame(table(t))
t <- t[order(-t$Freq),]
names(t)[1] <- "Name"
head(t)
## Name Freq
## 1031 Chili's Grill & Bar 4652
## 4691 Taco Bell 1618
## 2393 Jack in the Box 711
## 4250 Rubio's 343
## 1645 El Pollo Loco 331
## 306 Baja Fresh 242
SE <- which(df$kc == "1")
W <- which(df$kc == "2")
MW <- which(df$kc == "3")
NE <- which(df$kc == "4")
C <- which(df$kc == "5")
# overall
temp <- df[SE, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bSE <- temp
names(bSE)[1] <- "Southeast"
temp <- df[W, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bW <- temp
names(bW)[1] <- "West"
temp <- df[MW, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bMW <- temp
names(bMW)[1] <- "Midwest"
temp <- df[NE, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bNE <- temp
names(bNE)[1] <- "Northeast"
temp <- df[C, "name"]
temp <- as.data.frame(table(temp))
temp <- temp[order(-temp$Freq),]
bC <- temp
names(bC)[1] <- "Central"
head(bSE)
## Southeast Freq
## 531 Chili's Grill & Bar 1212
## 2137 Rubio's 717
## 858 El Pollo Loco 661
## 1216 Jack in the Box 463
## 147 Baja Fresh 455
## 155 Baker's Drive Thru 389
head(bW)
## West Freq
## 289 Chili's Grill & Bar 1342
## 1156 Sonic Drive In 688
## 1205 Taco Bell 555
## 852 McDonald's 489
## 1147 Skyline Chili 198
## 257 Casey's General Store 107
head(bMW)
## Midwest Freq
## 189 Chili's Grill & Bar 1900
## 733 Sonic Drive In 802
## 765 Taco Bell 510
## 530 McDonald's 395
## 346 Frontera Mex-Mex Grill 62
## 54 Barberitos 58
head(bNE)
## Northeast Freq
## 349 Chili's Grill & Bar 1156
## 1489 Taco Bell 577
## 1020 McDonald's 447
## 1416 Sonic Drive In 164
## 113 Baja Fresh Mexican Grill 99
## 494 Dos Amigos Burritos 93
head(bC)
## Central Freq
## 167 Chili's Grill & Bar 1567
## 839 Sonic Drive In 1212
## 629 McDonald's 307
## 874 Taco Bell 276
## 456 Jack in the Box 234
## 362 Freebirds World Burrito 121
city <- df$city
city <- as.data.frame(table(city))
city <- city[order(-city$Freq),]
city[1:5,]
## city Freq
## 2645 San Diego 879
## 1709 Los Angeles 819
## 531 Chicago 671
## 2080 New York 649
## 2648 San Francisco 605
** ### What is in these tacos and burritos? To figure this out, we go through the menu descriptions and find the most used meaningful words.
# ref : https://stackoverflow.com/questions/26159754/using-r-to-find-top-ten-words-in-a-text
txt <- df$menus.description
txt <- na.omit(txt)
# for qdap library
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_221')
library(qdap)
ft <- freq_terms(txt, 10, stopwords = Top100Words)
plot(ft)
Cheese is something that is widely mentioned in dish descriptions. Seems like our restaurateur can’t go wrong with a dish made from cheese, tortilla, and beans. Unless something they get a little too excited and try out their treasured burrito smoothie recipe which is clearly meant for a niche audience. But that is outside of the scope of this analysis and we digress.
While we did not meet all of our objectives, here is what can still be done. We can examine the prices in the different regions and maybe try to identify regional preferences for ingredients. We can filter restaurants by overall prices for consumer interest. Since we used R, we can create a Shiny app for the consumer to interact with the data.