How to Open a Coffee Shop in NYC

A data analytics project — simulating a business decision with open data

Published

March 31, 2026

About this project

This project simulates the research a data analyst would do to open a coffee shop in New York City. The project incorporates multiple datasets, aiming to answer various crucial questions for the future success of the business.

Project Canvas:

📍 Where in NYC should the shop be located?
Which neighbourhoods offer the best balance of subway access and safety – where foot traffic is expected?
☕ What should be on the menu?
Which coffee and tea products actually drive the most revenue?
📅 Which holidays are worth targeting with promotions?
When does consumer spending peak — and by how much?
🌍 Where should the coffee be sourced from?
Who are the major exporters, and what do they charge per tonne?
Bonus Question!
In the future, to which countries should we consider expanding internationally?

Dataset	Source	Purpose
NYPD Arrest Data (H2)	NYC Open Data	Safety scoring by ZIP
MTA Subway Stations	NYC Open Data	Transit access scoring
Coffee Shop Sales	Kaggle	Menu product analysis
Vending Machine Sales	Kaggle	Menu product analysis
Walmart Weekly Sales	Kaggle	Holiday spending proxy
UN Comtrade API (HS 0901)	UN Statistics	Coffee export suppliers
Coffee Consumption by Country 2026	World Population Review	International exposure

Pipeline: Raw data is cleaned, aggregated into analysis-ready tables and evaluated + visualized. raw -> stage -> mart -> viz

Q1 — Where to open?

The top-scoring locations combine high subway station density with below-average crime. Long Island City (11101) ranks first with a score of 82.3/100 — 13 subway stations and moderate arrests relative to its transit density. However, it is more of an outlier, as both the number of subways aree crimes are high. Eastern Brooklyn dominates the top 10.

Show code

nyc_subway_locations <- read.csv(file.path(RAW, 'MTA_Subway_Stations.csv'))
colnames(nyc_subway_locations)
nyc_subway_locations <- nyc_subway_locations |> 
  select('Station.ID', 'GTFS.Latitude', 'GTFS.Longitude', 'Georeference', 'Complex.ID', 'Borough') |> 
  na.omit()
write.csv(nyc_subway_locations, file.path(STAGE, "stg_nyc_subway_locations.csv"), row.names = F)

crime_data <- read.csv(file.path(RAW,'NYPD_Arrest_Data__Year_to_Date_.csv'))
crime_data$ARREST_MONTH <- month(as.Date(crime_data$ARREST_DATE, , format = "%m/%d/%Y"))
unique(crime_data$ARREST_MONTH)
crime_data <- crime_data |>  
  filter(ARREST_MONTH %in% c(7,8,9,10,11,12)) |>
  select(ARREST_KEY, Latitude, Longitude, Location) |> 
  na.omit() 
write.csv(crime_data, file.path(STAGE, "stg_crime_data.csv"), row.names = F)

options(tigris_use_cache = T)
dir.create("map", showWarnings = F)

crime  <- read.csv("stage/stg_crime_data.csv")
subway <- read.csv("stage/stg_nyc_subway_locations.csv") |>
  rename(Latitude = GTFS.Latitude, Longitude = GTFS.Longitude)

nyc_zips <- zctas(state = "NY", year = 2010) |>
  st_transform(4326) |>
  select(zipcode = ZCTA5CE10, geometry)

crime_sf  <- st_as_sf(crime,  coords = c("Longitude", "Latitude"), crs = 4326)
subway_sf <- st_as_sf(subway, coords = c("Longitude", "Latitude"), crs = 4326)

crime_counts <- st_join(crime_sf, nyc_zips, left = FALSE) |>
  st_drop_geometry() |>
  count(zipcode, name = "crime_count")

subway_counts <- st_join(subway_sf, nyc_zips, left = FALSE) |>
  st_drop_geometry() |>
  count(zipcode, name = "subway_stations")

nyc_zips <- nyc_zips |>
  left_join(crime_counts,  by = "zipcode") |>
  left_join(subway_counts, by = "zipcode") |>
  mutate(
    crime_count     = coalesce(crime_count,     0L),
    subway_stations = coalesce(subway_stations, 0L)
  ) |>
  # Only exclude ZIPs with no signal at all
  filter(crime_count > 0 | subway_stations > 0)

nyc_zips <- nyc_zips |>
  mutate(
    # BUG FIX: normalize on crime_count directly, no population division
    crime_norm  = (crime_count - min(crime_count)) / (max(crime_count) - min(crime_count)),
    subway_norm = (subway_stations - min(subway_stations)) / (max(subway_stations) - min(subway_stations)),
    
    # Score: lower crime = safer (+50), more subway = better connected (+50)
    score = round((1 - crime_norm) * 50 + subway_norm * 50, 1),
    
    # BUG FIX: crime_level now based on crime_count, not crime_rate
    crime_level = case_when(
      crime_count <= quantile(crime_count, 0.33) ~ "Low",
      crime_count <= quantile(crime_count, 0.66) ~ "Medium",
      TRUE                                        ~ "High"
    )
  ) |>
  arrange(desc(score)) |>
  mutate(rank = row_number())

nyc_zips |>
  st_drop_geometry() |>
  select(rank, zipcode, crime_level, crime_count, subway_stations, score) |>
  write.csv("mart/neighborhood_scores.csv", row.names = FALSE)

pal <- colorNumeric(palette = "RdYlGn", domain = nyc_zips$score)

leaflet(nyc_zips) |>
  addProviderTiles("CartoDB.DarkMatter") |>
  addPolygons(
    fillColor   = ~pal(score),
    fillOpacity = 0.7,
    color       = "#222",
    weight      = 0.8,
    popup = ~paste0(
      "<b>ZIP ", zipcode, "</b><br>",
      "Rank: #", rank, "<br>",
      "Score: <b>", score, "</b>/100<br>",
      "Crime: <b>", crime_level, "</b> (", format(crime_count, big.mark = ","), " arrests)<br>",
      "Subway stations: <b>", subway_stations, "</b>"
    )
  ) |>
  addLegend("bottomleft", pal = pal, values = ~score,
            title = "&#9749; Location Score", opacity = 0.9) |>
  htmlwidgets::saveWidget("map/nyc_location_map.html", selfcontained = TRUE)

Interactive map

Each ZIP code is coloured from red (poor score) to green (excellent). Click any polygon to see rank, score, crime level, and subway station count.

# A tibble: 6 × 6
   rank zipcode crime_level crime_count subway_stations score
  <dbl>   <dbl> <chr>             <dbl>           <dbl> <dbl>
1     1   11101 High               1295              13  82.3
2     2   11207 High               1771              13  75.8
3     3   11215 Medium              228               7  73.8
4     4   11201 High               1984              12  69.1
5     5   11206 High               1141               9  69  
6     6   10007 Medium              628               7  68.4

How the score is calculated

\[ \text{Score} = 50 \times \underbrace{\text{(normalized safety)}}_{\text{lower crime → higher points}} + 50 \times \underbrace{\text{(normalized transit access)}}_{\text{more subway → higher points}} \]

Q2 — What to sell?

Three products score at the top revenue percentile across all channels and should anchor the menu: Barista Espresso, Latte, and Brewed Chai Tea. Americano with Milk and Gourmet Brewed Coffee are strong additions.

Show code

coffee_shop_sales <- read_excel(file.path(RAW, 'Coffee Shop Sales.xlsx'))
coffee_shop_sales <- coffee_shop_sales |> 
  filter(product_category %in% c("Coffee", "Tea")) |> 
  select(transaction_id, transaction_date, transaction_qty, unit_price,
         product_category, product_type, product_detail) |> 
  na.omit()
write.csv(coffee_shop_sales, file.path(STAGE, "stg_coffee_shop_sales.csv"), row.names = F)

vending_machine_sales <- read.csv(file.path(RAW, 'index_1.csv'))
vending_machine_sales <- vending_machine_sales |> 
  select(datetime, money, coffee_name) |> 
  na.omit()
write.csv(vending_machine_sales, file.path(STAGE, "stg_vending_machine_sales.csv"), row.names = F)


coffee_sales <- coffee_shop_sales |> 
  filter(product_category == "Coffee") |>  
  group_by(product_type) |> 
  summarise(
    num_transactions = n(),                                      
    total_units_sold = sum(transaction_qty),       
    total_revenue    = sum(transaction_qty * unit_price)) |> 
  rename(product = product_type) |> 
  mutate(share_of_sales = round(total_revenue / sum(total_revenue) * 100, 1),
         score = percent_rank(total_revenue)
  ) |> 
  arrange(desc(total_revenue))

tea_sales <- coffee_shop_sales |> 
  filter(product_category == "Tea") |> 
  group_by(product_type) |> 
  summarise(
    num_transactions = n(),                                      
    total_units_sold = sum(transaction_qty),       
    total_revenue    = sum(transaction_qty * unit_price)                         
  ) |>
  rename(product = product_type) |> 
  mutate(share_of_sales = round(total_revenue / sum(total_revenue) * 100, 1),
         score = percent_rank(total_revenue)) |>
  arrange(desc(total_revenue))

vending_machine_sales <- vending_machine_sales |> 
  group_by(coffee_name) |> 
  summarise(
    total_revenue = sum(money),
    num_transactions = n()
  ) |> 
  rename(product = coffee_name) |> 
  mutate(share_of_sales = round(total_revenue / sum(total_revenue) * 100, 1),
         score = percent_rank(total_revenue)) |>
  filter(share_of_sales >= 10) |> 
  arrange(desc(total_revenue))

How the score is calculated

Revenue is computed separately for three channels: coffee shop coffees, coffee shop teas, and vending machine drinks. Each product is ranked using percent_rank(total_revenue) within its channel — a score of 1.0 means top revenue in that channel. Only products above the 50th percentile are retained.

Show code

products_list <- bind_rows(
  coffee_sales,
  tea_sales,
  vending_machine_sales) |>
  select(product, score) |> 
  filter(score >= 0.5) |> 
  arrange(desc(score))
write.csv(products_list, file.path(MART, 'products_list.csv'), row.names = F)

Q3 — When to run promotions?

Thanksgiving (28 Nov) drives the highest average weekly spend — $66.2M, which is 53% above Christmas. Valentine’s Day and Labor Day are meaningful secondary peaks. These four dates are the clearest opportunities for seasonal menu specials, pricing adjustments, or targeted marketing.

Show code

holidays_elasticity <- read.csv(file.path(RAW, 'Walmart.csv'))
holidays_elasticity <- holidays_elasticity |>
  filter(
    Holiday_Flag == 1,
    year(as.Date(Date, format = "%d-%m-%Y")) %in% c(2010, 2011, 2012)
  ) |>
  select(Date, Weekly_Sales) |>
  na.omit() |>
  group_by(Date) |>
  summarise(
    Holiday_Sales = sum(Weekly_Sales),
    .groups = "drop"
  ) |>
  arrange(month(as.Date(Date, format = "%d-%m-%Y")),
          day(as.Date(Date, format = "%d-%m-%Y")))
write.csv(holidays_elasticity, file.path(STAGE, "stg_holidays_elasticity.csv"), row.names = F)

# A tibble: 4 × 3
  Holiday_Name    Avg_Sales Date 
  <chr>               <dbl> <chr>
1 Thanksgiving    66207304. 28-11
2 Valentine's Day 48560759. 14-02
3 Labor Day       46909228. 01-09
4 Christmas       43237490. 25-12

How this was measured

Walmart weekly store sales (2010–2012) serve as a consumer spending proxy. Only weeks flagged as Holiday_Flag == 1 are included. Sales are averaged across all three years and all stores for each holiday. This captures the relative demand curve across the year — the same pattern is expected to apply to coffee shop upsell potential.

Q4 — Where to source coffee?

Brazil is the default supplier for volume — largest export value ($14.7B), highest volume (4.2M tonnes), and the lowest average price per tonne ($3,466). For premium or specialty offerings, Colombia and Ethiopia command higher prices but have strong export track records and globally recognised quality reputations.

Show code

API_KEY <- "9ee67ca3bb544e37bca294e322b3aa7f"  # https://comtradedeveloper.un.org/
YEAR    <- "2023"
HS_CODE <- "0901"  # Coffee

resp <- request("https://comtradeapi.un.org/data/v1/get/C/A/HS") |>
  req_url_query(
    cmdCode     = HS_CODE,
    flowCode    = "X",       # exports
    period      = YEAR,
    partnerCode = "0",       # world total
    partner2Code = "0",
    includeDesc = "True",
    `subscription-key` = API_KEY
  ) |>
  req_perform() |>
  resp_body_json(simplifyVector = T)

re_exporters <- c("Germany", "Switzerland", "Italy", "France", "Belgium",
                  "Netherlands", "Spain", "USA", "Canada",
                  "United Kingdom", "Sweden", "Poland", "Austria", 
                  "Estonia", "United Arab Emirates", "Slovenia", "Slovakia",
                  "Bulgaria", "Portugal", "Luxembourg", "Czechia", 
                  "Lithuania", "Hungary", "Denmark", "Türkiye")

country_exports <- resp$data |>
  as_tibble() |>
  select(
    Country      = reporterDesc,
    Country_Code = reporterCode,
    Year         = period,
    Value_USD    = primaryValue,
    Weight_kg    = netWgt
  ) |>
  group_by(Country, Country_Code, Year) |>
  summarise(Value_USD = sum(Value_USD), Weight_kg = sum(Weight_kg), .groups = "drop") |>
  mutate(
    Weight_tons = Weight_kg / 1000,
    Avg_Price_USD_per_ton = round(Value_USD / Weight_tons, 2)
  ) |>
  filter(!is.na(Value_USD), !is.na(Weight_kg), Value_USD > 0, Weight_kg > 0,
         Avg_Price_USD_per_ton >= 3400, Weight_tons >= 10000, Value_USD >= 100e6, !Country %in% re_exporters) |>
  arrange(Avg_Price_USD_per_ton) |>
  mutate(Rank = row_number()) |>
  select(Rank, everything(), -Weight_kg)

write_csv(country_exports, file.path(MART, "coffee_country_exports.csv"))

# A tibble: 6 × 7
   Rank Country   Country_Code  Year Value_USD Weight_tons Avg_Price_USD_per_ton
  <dbl> <chr>            <dbl> <dbl>     <dbl>       <dbl>                 <dbl>
1     1 Brazil              76  2023   1.47e10    4242198.                 3466.
2     2 Papua Ne…          598  2023   2.27e 8      59016.                 3853.
3     3 Peru               604  2023   8.29e 8     204798.                 4049.
4     4 Honduras           340  2023   1.49e 9     343209.                 4335.
5     5 Nicaragua          558  2023   6.09e 8     140378.                 4341.
6     6 Colombia           170  2023   1.17e10    2366659.                 4926.

Data source

Pulled from the UN Comtrade API (HS code 0901 — coffee, roasted/unroasted, 2023). Re-exporters (Germany, Italy, USA, Netherlands, etc.) are excluded to show only true origin countries. Filtered to exporters with ≥ 10,000 tonnes, ≥ $100M value, and avg. price ≥ $3,400/t to remove noise.

Bonus — Where to expand after NYC?

If the NYC shop succeeds, Hong Kong (43.23), Laos (22.59), and Guinea (9.97) top the list of international expansion candidates by coffee consumption rate. Among larger, more accessible markets, Canada (6.84), Italy (6.11), and Australia (5.88) combine high consumption with stable business environments.