This workbook documents the full analytical process behind the HubSpot Integrations Model Exercise, from data ingestion and cleaning through to synthesis and recommendations. Each section follows a consistent structure: setup and rationale, code with inline comments, and post-analysis observations interpreting the results. All data is the synthetic HubSpot dataset provided for this exercise (January–June 2024, n=13,651 customers).

1: Setup & Data Ingestion

Data is synthetic and provided by HubSpot for recruitment purposes. This section loads all required libraries and imports the two datasets from Google Drive. Both files contain monthly snapshots of HubSpot customer data across a six month observation period (January–June 2024).

# ============================================================
# SECTION 1: SETUP & DATA INGESTION
# ============================================================

# --- STEP 1: SETUP & PACKAGE INSTALLATION ---

# Install packages (only runs if not already installed)
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("readxl")) install.packages("readxl")
if (!require("scales")) install.packages("scales")
if (!require("googledrive")) install.packages("googledrive")
if (!require("googlesheets4")) install.packages("googlesheets4")

# Load libraries
library(tidyverse)
library(readxl)
library(scales)
library(googledrive)
library(googlesheets4)

# --- STEP 2: GOOGLE DRIVE AUTHORIZATION & FILE IMPORT ---

# Authenticate with Google Drive
# A browser prompt will open asking you to authorize — sign in
# with your Google account and allow access
drive_auth()

# Download files from Google Drive to your local session
drive_download(
  file = "customer_metadata.xlsx",
  path = "customer_metadata.xlsx",
  overwrite = TRUE
)

drive_download(
  file = "customer_integrations.xlsx",
  path = "customer_integrations.xlsx",
  overwrite = TRUE
)

# Load into dataframes
df_metadata <- read_excel("customer_metadata.xlsx")
df_integrations <- read_excel("customer_integrations.xlsx")

# Quick sanity check
glimpse(df_metadata)
## Rows: 62,339
## Columns: 8
## $ CUSTOMER_ID                <chr> "C00020862", "C00020863", "C00020864", "C00…
## $ SNAPSHOT_MONTH             <dttm> 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ MARKETING_HUB_TIER         <chr> "Ent", "STarter", "STarter", "Pro", "Pro", …
## $ MARKETING_HUB_MRR          <chr> "674.14", "14.16", "9.49", "346.97", "295.9…
## $ MARKETING_HUB_MRR_CANCELED <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…
## $ SALES_HUB_TIER             <chr> "Enterprise", "NA", "NA", "NA", "NA", "NA",…
## $ SALES_HUB_MRR              <chr> "-17.55", "NA", "NA", "NA", "NA", "NA", "NA…
## $ SALES_HUB_MRR_CANCELED     <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…
glimpse(df_integrations)
## Rows: 207,506
## Columns: 3
## $ CUSTOMER_ID    <chr> "C00020862", "C00020862", "C00020862", "C00020862", "C0…
## $ SNAPSHOT_MONTH <dttm> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ INTEGRATION    <chr> "GoogleCalendar", "FacebookAds", "GoogleAds", "Calendly…

2: Data Cleaning

Pre-Clean Observations

Upon initial inspection of both datasets, the following issues were identified and will be addressed before analysis:

Customer Metadata:

  1. MARKETING_HUB_TIER contains inconsistent naming conventions for the same tiers: Ent / Enterprise, STarter / Starter, Pro / Professional - will normalize to a single standard.

  2. MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED, SALES_HUB_MRR, and SALES_HUB_MRR_CANCELED were imported as character strings instead of numeric. This is due to "NA" being stored as literal text rather than a true NA value. Will convert to numeric after replacing.

  3. SALES_HUB_TIER also contains string "NA" values where a customer has no Sales Hub product - will convert to true NA.

  4. SALES_HUB_MRR contains at least one negative value (-17.55), which may represent a credit or adjustment. Will flag and monitor but not remove, as it may be intentional.

Customer Integrations:

  1. INTEGRATION contains duplicate naming formats for three integrations: GoogleAds / Google Ads, FacebookAds / Facebook Ads, GoogleCalendar / Google Calendar - will standardize to spaced format for consistency and display.

Cleaning

# ============================================================
# SECTION 2: DATA CLEANING
# ============================================================

## --- CLEANING: CUSTOMER_METADATA ---

df_metadata_clean <- df_metadata %>%
  
  # 1. Normalize Marketing Hub tier naming
  mutate(MARKETING_HUB_TIER = case_when(
    MARKETING_HUB_TIER %in% c("Ent", "Enterprise") ~ "Enterprise",
    MARKETING_HUB_TIER %in% c("STarter", "Starter") ~ "Starter",
    MARKETING_HUB_TIER %in% c("Pro", "Professional") ~ "Professional",
    TRUE ~ NA_character_
  )) %>%
  
  # 2. Normalize Sales Hub tier naming (already cleaner but apply same pattern)
  mutate(SALES_HUB_TIER = case_when(
    SALES_HUB_TIER == "Enterprise" ~ "Enterprise",
    SALES_HUB_TIER == "Starter" ~ "Starter",
    SALES_HUB_TIER == "Professional" ~ "Professional",
    TRUE ~ NA_character_
  )) %>%
  
  # 3. Replace string "NA" with true NA, then convert MRR columns to numeric
  mutate(across(
    c(MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED,
      SALES_HUB_MRR, SALES_HUB_MRR_CANCELED),
    ~ as.numeric(na_if(., "NA"))
  )) %>%
  
  # 4. Ensure SNAPSHOT_MONTH is stored as Date (not datetime)
  mutate(SNAPSHOT_MONTH = as.Date(SNAPSHOT_MONTH))


# --- CLEANING: CUSTOMER_INTEGRATIONS ---

df_integrations_clean <- df_integrations %>%
  
  # 5. Standardize integration names to spaced format
  mutate(INTEGRATION = case_when(
    INTEGRATION == "GoogleAds" ~ "Google Ads",
    INTEGRATION == "FacebookAds" ~ "Facebook Ads",
    INTEGRATION == "GoogleCalendar" ~ "Google Calendar",
    TRUE ~ INTEGRATION
  )) %>%

  # 6. Ensure SNAPSHOT_MONTH is stored as Date
  mutate(SNAPSHOT_MONTH = as.Date(SNAPSHOT_MONTH))


# --- JOIN VALIDATION ---

# Check unique customers in each dataset
meta_customers <- df_metadata_clean %>% distinct(CUSTOMER_ID)
int_customers  <- df_integrations_clean %>% distinct(CUSTOMER_ID)


# --- POST-CLEANING VALIDATION ---

# Confirm Normalization --- Marketing Hub Tiers ---
print(table(df_metadata_clean$MARKETING_HUB_TIER, useNA = "always"))
## 
##   Enterprise Professional      Starter         <NA> 
##         7299        23361        29175         2504
# Confirm Normalization --- Sales Hub Tiers ---
print(table(df_metadata_clean$SALES_HUB_TIER, useNA = "always"))
## 
##   Enterprise Professional      Starter         <NA> 
##         4918         6049         6101        45271
# Confirm --- Integration Names ---
print(sort(unique(df_integrations_clean$INTEGRATION)))
##  [1] "Calendly"        "Canva"           "Eventbrite"      "Facebook Ads"   
##  [5] "Gmail"           "Google Ads"      "Google Calendar" "Instagram"      
##  [9] "LinkedIn"        "Mailchimp"       "Typeform"        "Zapier"
# Confirm --- MRR Column Types ---
print(sapply(df_metadata_clean[, c(
            "MARKETING_HUB_MRR", 
            "MARKETING_HUB_MRR_CANCELED", 
            "SALES_HUB_MRR", 
            "SALES_HUB_MRR_CANCELED")], class))
##          MARKETING_HUB_MRR MARKETING_HUB_MRR_CANCELED 
##                  "numeric"                  "numeric" 
##              SALES_HUB_MRR     SALES_HUB_MRR_CANCELED 
##                  "numeric"                  "numeric"
# Check Count --- Negative MRR Values in Sales Hub ---
df_metadata_clean %>%
  filter(!is.na(SALES_HUB_MRR) & SALES_HUB_MRR < 0) %>%
  summarise(count = n(), total_negative_mrr = sum(SALES_HUB_MRR)) %>%
  print()
## # A tibble: 1 × 2
##   count total_negative_mrr
##   <int>              <dbl>
## 1     4              -70.2
# Review Coverage Counts
cat("--- Customer Coverage ---\n",
    "Unique customers in metadata:", 
    nrow(meta_customers), "\n",
    "Unique customers in integrations:", 
    nrow(int_customers), "\n",
    "Customers in BOTH datasets:", 
    nrow(inner_join(meta_customers, int_customers, by = "CUSTOMER_ID")), "\n",
    "Customers in metadata ONLY (no integrations):", 
    nrow(anti_join(meta_customers, int_customers, by = "CUSTOMER_ID")), "\n",
    "Customers in integrations ONLY (not in metadata):", 
    nrow(anti_join(int_customers, meta_customers, by = "CUSTOMER_ID")), "\n")
## --- Customer Coverage ---
##  Unique customers in metadata: 13651 
##  Unique customers in integrations: 13368 
##  Customers in BOTH datasets: 13368 
##  Customers in metadata ONLY (no integrations): 283 
##  Customers in integrations ONLY (not in metadata): 0
# Check for duplicate CUSTOMER_ID + SNAPSHOT_MONTH in metadata
# (should be zero — one row per customer per month)
meta_dupes <- df_metadata_clean %>%
  group_by(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
  filter(n() > 1) %>%
  ungroup()

cat("--- Metadata Duplicate Check ---\n",
    "Duplicate CUSTOMER_ID + SNAPSHOT_MONTH rows:", nrow(meta_dupes), "\n")
## --- Metadata Duplicate Check ---
##  Duplicate CUSTOMER_ID + SNAPSHOT_MONTH rows: 0

Post-Clean Observations

Marketing Hub Tiers: All three tier values normalized cleanly into Enterprise, Professional, and Starter. The 2,504 NA values are expected and represent customers who hold a Sales Hub subscription only, with no Marketing Hub product.

Sales Hub Tiers: Normalization applied successfully. The large NA count (45,271) is expected. The majority of customers in this dataset are Marketing Hub only, which is confirmed by the ~73% NA rate in SALES_HUB_TIER. As a result, Sales Hub analysis will be treated as directional only throughout this workbook, with Marketing Hub MRR serving as the primary and more statistically reliable signal.

Integration Names: All 12 integrations are now cleanly and distinctly named. The three duplicate naming formats (GoogleAds, FacebookAds, GoogleCalendar) have been standardized to their spaced equivalents for consistency and display readability.

MRR Column Types: All four MRR columns (MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED, SALES_HUB_MRR, SALES_HUB_MRR_CANCELED) successfully converted from character strings to numeric. String "NA" values have been replaced with true R NA.

Negative MRR Values: Four rows were identified with a combined Sales Hub MRR of -$70.20. Given the scale of the dataset (62,000+ rows), this is negligible and most likely represents a credit or billing adjustment. These rows have been flagged and will be monitored but not removed, as removal could introduce bias into aggregated MRR calculations.

Join Validation: The join between both datasets is clean and behaves as expected:

  • 13,651 unique customers in metadata
  • 13,368 customers appear in both datasets, confirming at least one active integration
  • 283 customers appear in metadata only - these are zero-integration customers (~2% of the base) and will serve as the baseline comparison group throughout the retention analysis
  • 0 customers appear in integrations only, confirming a clean one-to-many relationship with no orphaned records
  • 0 duplicate CUSTOMER_ID + SNAPSHOT_MONTH combinations - the metadata grain is confirmed as one row per customer per month, making it safe to join without aggregation concerns

3: Integration Landscape

# ============================================================
# SECTION 3: INTEGRATION LANDSCAPE
# ============================================================

# --- 3.1 OVERALL ADOPTION RATE ---
# What percentage of total customer base used at least
# one integration in any given month?

adoption_by_month <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      distinct(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
      mutate(has_integration = TRUE),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  mutate(has_integration = replace_na(has_integration, FALSE)) %>%
  group_by(SNAPSHOT_MONTH) %>%
  summarise(
    total_customers     = n_distinct(CUSTOMER_ID),
    with_integration    = sum(has_integration),
    adoption_rate       = with_integration / total_customers
  )

# Adoption by month
knitr::kable(adoption_by_month,
             digits = 3,
             caption = "Monthly Integration Adoption Rate",
             col.names = c("Snapshot Month", "Total Customers", 
                           "With Integration", "Adoption Rate"))
Monthly Integration Adoption Rate
Snapshot Month Total Customers With Integration Adoption Rate
2024-01-01 9996 9622 0.963
2024-02-01 10070 9726 0.966
2024-03-01 10229 9921 0.970
2024-04-01 10395 10119 0.973
2024-05-01 10675 10404 0.975
2024-06-01 10974 10747 0.979
# Overall adoption rate across full period
overall_adoption <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      distinct(CUSTOMER_ID) %>%
      mutate(has_integration = TRUE),
    by = "CUSTOMER_ID"
  ) %>%
  mutate(has_integration = replace_na(has_integration, FALSE)) %>%
  summarise(
    total_customers  = n_distinct(CUSTOMER_ID),
    with_integration = sum(has_integration) / 6, # divide by 6 months
    adoption_rate    = (n_distinct(CUSTOMER_ID[has_integration]) / 
                        n_distinct(CUSTOMER_ID))
  )

# Overall adoption
knitr::kable(overall_adoption,
             digits = 3,
             caption = "Overall Adoption Rate (Full Period)",
             col.names = c("Total Customers", "With Integration", 
                           "Adoption Rate"))
Overall Adoption Rate (Full Period)
Total Customers With Integration Adoption Rate
13651 10221.33 0.979
# --- 3.2 INTEGRATION POPULARITY & TRENDS ---
# Which integrations are most used, and how has usage
# trended across the 6-month observation period?

# Overall popularity ranking
integration_popularity <- df_integrations_clean %>%
  group_by(INTEGRATION) %>%
  summarise(
    total_usage      = n(),
    unique_customers = n_distinct(CUSTOMER_ID)
  ) %>%
  arrange(desc(unique_customers))

# --- Integration Popularity Ranking ---
knitr::kable(integration_popularity %>% 
             select(INTEGRATION, total_usage, unique_customers),
             digits = 0,
             caption = "Integration Popularity Ranking",
             col.names = c("Integration", "Total Usage", 
                           "Unique Customers"))
Integration Popularity Ranking
Integration Total Usage Unique Customers
Mailchimp 19296 5528
Typeform 23483 5416
Canva 19949 4598
LinkedIn 19786 4590
Google Ads 18855 4390
Eventbrite 17963 4185
Facebook Ads 13915 4142
Instagram 16593 4047
Gmail 16957 3954
Google Calendar 16135 3729
Calendly 14314 3404
Zapier 10260 2365
# Monthly trend by integration
integration_trends <- df_integrations_clean %>%
  group_by(SNAPSHOT_MONTH, INTEGRATION) %>%
  summarise(
    unique_customers = n_distinct(CUSTOMER_ID),
    .groups = "drop"
  )

# --- Monthly Trends by Integration (first rows) ---
print(head(integration_trends, 20))
## # A tibble: 20 × 3
##    SNAPSHOT_MONTH INTEGRATION     unique_customers
##    <date>         <chr>                      <int>
##  1 2024-01-01     Calendly                    2180
##  2 2024-01-01     Canva                       3173
##  3 2024-01-01     Eventbrite                  2803
##  4 2024-01-01     Facebook Ads                1301
##  5 2024-01-01     Gmail                       2632
##  6 2024-01-01     Google Ads                  2952
##  7 2024-01-01     Google Calendar             2516
##  8 2024-01-01     Instagram                   2721
##  9 2024-01-01     LinkedIn                    3102
## 10 2024-01-01     Mailchimp                   2015
## 11 2024-01-01     Typeform                    3708
## 12 2024-01-01     Zapier                      1455
## 13 2024-02-01     Calendly                    2230
## 14 2024-02-01     Canva                       3205
## 15 2024-02-01     Eventbrite                  2851
## 16 2024-02-01     Facebook Ads                1638
## 17 2024-02-01     Gmail                       2694
## 18 2024-02-01     Google Ads                  2993
## 19 2024-02-01     Google Calendar             2560
## 20 2024-02-01     Instagram                   2684
# --- 3.3 INTEGRATION COUNT DISTRIBUTION ---
# How many integrations does each customer use per month?
# Bucketed into 0, 1, 2, 3+ for cleaner storytelling.

integration_counts <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      group_by(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
      summarise(int_count = n_distinct(INTEGRATION), .groups = "drop"),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  mutate(
    int_count  = replace_na(int_count, 0),
    int_bucket = case_when(
      int_count == 0 ~ "0 Integrations",
      int_count == 1 ~ "1 Integration",
      int_count == 2 ~ "2 Integrations",
      int_count >= 3 ~ "3+ Integrations"
    ),
    int_bucket = factor(int_bucket, levels = c(
      "0 Integrations", "1 Integration",
      "2 Integrations", "3+ Integrations"
    ))
  )

# Distribution summary
int_distribution <- integration_counts %>%
  group_by(int_bucket) %>%
  summarise(
    customer_months = n(),
    pct = customer_months / nrow(integration_counts)
  )

# Integration distribution
knitr::kable(int_distribution,
             digits = 3,
             caption = "Integration Count Distribution",
             col.names = c("Integration Depth", "Customer-Months", 
                           "% of Total"))
Integration Count Distribution
Integration Depth Customer-Months % of Total
0 Integrations 1800 0.029
1 Integration 7165 0.115
2 Integrations 12078 0.194
3+ Integrations 41296 0.662
# note: This is a customer-month table, not distinct customers.
# With 13,651 customers × 6 months = ~62,339
# It tells us the share of customer-months spent at each integration depth.


# A--- Average Integrations Per Customer Per Month ---
integration_counts %>%
  summarise(avg_integrations = mean(int_count)) %>%
  print()
## # A tibble: 1 × 1
##   avg_integrations
##              <dbl>
## 1             3.33

Observations

Adoption Rate: Integration adoption is near-universal, ranging from 96.3% in January to 97.9% by June 2024 and trending upward each month. This suggests integrations are not a niche feature, they are a core behavior across the customer base.

Integration Popularity: Mailchimp and Typeform lead in unique customers, while Zapier ranks last. Notably, Mailchimp’s high unique customer count relative to its total usage rows suggests broad but moderate adoption. Zapier’s low ranking is worth investigating further - as a workflow automation tool, it may represent an underutilized but high-value integration given its operational requirement.

Integration Depth: 66.2% of customers use 3 or more integrations per month, with an average of 3.33 integrations per customer. This signals genuine platform embedding rather than casual or exploratory usage. The zero-integration segment is small (2.9%, 283 customers) but analytically important as a baseline comparison group.

Visualizations

# ============================================================
# SECTION 3: VISUALIZATIONS
# ============================================================

hs_orange <- "#FF7A59"
hs_aqua   <- "#00BDA5"
hs_navy   <- "#2D3E50"
hs_peach  <- "#FFF1EE"

# --- VIZ 1: Monthly Adoption Rate Trend ---
# Using orange line on peach background band for brand colors

ggplot(adoption_by_month, aes(x = SNAPSHOT_MONTH, y = adoption_rate)) +
  annotate("rect",
    xmin = min(adoption_by_month$SNAPSHOT_MONTH) - 15,
    xmax = max(adoption_by_month$SNAPSHOT_MONTH) + 15,
    ymin = 0.95, ymax = 1.00,
    fill = hs_peach, alpha = 0.5
  ) +
  geom_line(color = hs_orange, linewidth = 1.5) +
  geom_point(color = hs_orange, size = 3) +
  geom_text(
    aes(label = percent(adoption_rate, accuracy = 0.1)),
    vjust = -1.2, size = 3.5, fontface = "bold", color = hs_navy
  ) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0.95, 1.00)
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title    = "Integration Adoption Rate is Growing Month Over Month",
    subtitle = "% of customers with at least one active integration",
    x        = NULL,
    y        = "Adoption Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", color = hs_navy),
    plot.subtitle    = element_text(color = "gray50"),
    axis.text        = element_text(color = hs_navy),
    panel.grid.minor = element_blank()
  )

ggsave("viz1_adoption_trend.png", width = 10, height = 6, dpi = 300)


# --- VIZ 2: Integration Popularity Ranking ---
# Highlight top 3 in orange, remaining in aqua to create
# visual hierarchy without losing brand cohesion

integration_popularity <- integration_popularity %>%
  mutate(bar_color = ifelse(
    rank(desc(unique_customers)) <= 3, hs_orange, hs_aqua
  ))

ggplot(integration_popularity,
       aes(x = reorder(INTEGRATION, unique_customers),
           y = unique_customers,
           fill = bar_color)) +
  geom_col() +
  geom_text(
    aes(label = comma(unique_customers)),
    hjust = -0.1, size = 3.5, color = hs_navy
  ) +
  scale_fill_identity() +
  coord_flip() +
  scale_y_continuous(
    labels = comma_format(),
    limits = c(0, max(integration_popularity$unique_customers) * 1.15)
  ) +
  labs(
    title    = "Mailchimp and Typeform Lead Integration Adoption",
    subtitle = "Unique customers using each integration (Jan–Jun 2024)",
    x        = NULL,
    y        = "Unique Customers"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.y = element_blank()
  )

ggsave("viz2_integration_popularity.png", width = 10, height = 6, dpi = 300)

# --- VIZ 3: Integration Count Distribution ---
# Gradient effect: deepen color as integration depth increases
# to visually reinforce "more is better"

dist_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_orange,
  "3+ Integrations" = hs_navy
)

ggplot(int_distribution,
       aes(x = int_bucket, y = pct, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(pct, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = dist_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.75)
  ) +
  labs(
    title    = "Most Customers Use 3 or More Integrations",
    subtitle = "Distribution of integration depth per customer-month",
    x        = "Integration Depth",
    y        = "% of Customer-Months"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz3_integration_distribution.png", width = 10, height = 6, dpi = 300)

4: Do Integrations Matter?

# ============================================================
# SECTION 4: DO INTEGRATIONS MATTER? (CORE ANALYSIS)
# ============================================================

# --- SETUP: Build the master analysis table ---
# Join metadata with integration depth buckets created in 
# Section 3, so every customer-month has an integration
# bucket label alongside their MRR and cancellation data

master <- df_metadata_clean %>%
  left_join(
    integration_counts %>%
      select(CUSTOMER_ID, SNAPSHOT_MONTH, int_count, int_bucket),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  )

# Confirm Join --- Master Table Dimensions --
glimpse(master)
## Rows: 62,339
## Columns: 10
## $ CUSTOMER_ID                <chr> "C00020862", "C00020863", "C00020864", "C00…
## $ SNAPSHOT_MONTH             <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ MARKETING_HUB_TIER         <chr> "Enterprise", "Starter", "Starter", "Profes…
## $ MARKETING_HUB_MRR          <dbl> 674.14, 14.16, 9.49, 346.97, 295.94, 19.18,…
## $ MARKETING_HUB_MRR_CANCELED <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SALES_HUB_TIER             <chr> "Enterprise", NA, NA, NA, NA, NA, NA, NA, N…
## $ SALES_HUB_MRR              <dbl> -17.55, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SALES_HUB_MRR_CANCELED     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ int_count                  <int> 6, 2, 3, 2, 4, 3, 2, 3, 0, 1, 3, 1, 3, 4, 0…
## $ int_bucket                 <fct> 3+ Integrations, 2 Integrations, 3+ Integra…
# --- 4.1 CANCELLATION RATE BY INTEGRATION DEPTH ---
# Cancellation rate = canceled MRR / total MRR
# Using Marketing Hub as primary lens (dominant product)
# A customer-month is a "cancellation event" if
# MARKETING_HUB_MRR_CANCELED is not NA (i.e. has a value)

cancellation_by_bucket <- master %>%
  group_by(int_bucket) %>%
  summarise(
    total_customer_months  = n(),
    cancellation_events    = sum(!is.na(MARKETING_HUB_MRR_CANCELED)),
    total_mrr              = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr           = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate      = cancellation_events / total_customer_months,
    mrr_cancellation_rate  = canceled_mrr / total_mrr
  ) %>%
  arrange(int_bucket)

# --- Cancellation Rate by Integration Depth ---
knitr::kable(cancellation_by_bucket, 
             digits = 4,
             caption = "Cancellation Rate by Integration Depth")
Cancellation Rate by Integration Depth
int_bucket total_customer_months cancellation_events total_mrr canceled_mrr cancellation_rate mrr_cancellation_rate
0 Integrations 1800 171 157247.8 13195.61 0.0950 0.0839
1 Integration 7165 519 814699.0 50777.29 0.0724 0.0623
2 Integrations 12078 959 2023784.4 151109.68 0.0794 0.0747
3+ Integrations 41296 2325 12663783.2 580640.17 0.0563 0.0459
# --- 4.2 MRR RETENTION BY INTEGRATION DEPTH ---
# Customer Dollar Retention (C$R) as defined in brief:
# (1 - (lost value / starting value)) ^ 12
# First calculate monthly retention, then annualize

mrr_retention_by_bucket <- master %>%
  group_by(int_bucket, SNAPSHOT_MONTH) %>%
  summarise(
    starting_mrr      = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr      = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    monthly_retention = 1 - (canceled_mrr / starting_mrr),
    .groups = "drop"
  ) %>%
  group_by(int_bucket) %>%
  summarise(
    avg_monthly_retention  = mean(monthly_retention, na.rm = TRUE),
    annualized_retention   = avg_monthly_retention ^ 12
  ) %>%
  arrange(int_bucket)

# MRR retention by bucket
knitr::kable(mrr_retention_by_bucket,
             digits = 4,
             caption = "MRR Retention by Integration Depth",
             col.names = c("Integration Depth", "Avg Monthly Retention",
                           "Annualized Retention (C$R)"))
MRR Retention by Integration Depth
Integration Depth Avg Monthly Retention Annualized Retention (C$R)
0 Integrations 0.9170 0.3536
1 Integration 0.9378 0.4629
2 Integrations 0.9254 0.3942
3+ Integrations 0.9541 0.5688
# --- 4.3 AVERAGE MRR BY INTEGRATION DEPTH ---
# Are higher-integration customers also higher value?
# Important context for VP of CS

mrr_by_bucket <- master %>%
  group_by(int_bucket) %>%
  summarise(
    avg_mrr    = mean(MARKETING_HUB_MRR, na.rm = TRUE),
    median_mrr = median(MARKETING_HUB_MRR, na.rm = TRUE),
    total_mrr  = sum(MARKETING_HUB_MRR, na.rm = TRUE)
  ) %>%
  arrange(int_bucket)

# MRR by bucket
knitr::kable(mrr_by_bucket,
             digits = 2,
             caption = "Average MRR by Integration Depth",
             col.names = c("Integration Depth", "Avg MRR",
                           "Median MRR", "Total MRR"))
Average MRR by Integration Depth
Integration Depth Avg MRR Median MRR Total MRR
0 Integrations 91.37 15.66 157247.8
1 Integration 119.04 17.92 814699.0
2 Integrations 174.95 25.62 2023784.4
3+ Integrations 318.97 295.57 12663783.2

Observations

Cancellation Rate by Integration Depth: A clear pattern emerges: customers with zero integrations cancel at 9.5% monthly, declining to 5.6% for customers with 3+ integrations. That represents a ~41% reduction in cancellation rate at peak integration depth. One nuance worth noting: the 2-integration bucket (7.9%) slightly underperforms the 1-integration bucket (7.2%), which reappears in later analysis and appears to be a consistent pattern rather than noise.

Annualized Retention: The gap widens significantly when annualized. Customers with zero integrations retain at 35.4% annually versus 56.9% for customers with 3+ integrations (a 21.5 percentage point difference) that represents a meaningful business impact at scale.

Average MRR: Deeply integrated customers are not only more likely to stay, they are worth significantly more. Average MRR climbs from $91 for zero-integration customers to $319 for those with 3+ integrations, a 3.5x difference. This combination of higher value and lower churn makes the 3+ integration segment the most strategically important cohort in the dataset.

Visualizations

# ============================================================
# SECTION 4: VISUALIZATIONS
# ============================================================

# --- VIZ 4: Cancellation Rate by Integration Depth ---
# Highlight the 3+ bucket in orange, that's the headline

cancellation_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(cancellation_by_bucket,
       aes(x = int_bucket, y = cancellation_rate, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = cancellation_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.12)
  ) +
  labs(
    title    = "Customers with 3+ Integrations Cancel at Nearly Half the Rate",
    subtitle = "Monthly cancellation rate by integration depth | Marketing Hub",
    x        = "Integration Depth",
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz4_cancellation_by_depth.png", width = 10, height = 6, dpi = 300)

# --- VIZ 5: Annualized MRR Retention by Integration Depth ---

retention_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(mrr_retention_by_bucket,
       aes(x = int_bucket, y = annualized_retention, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(annualized_retention, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = retention_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.70)
  ) +
  labs(
    title    = "3+ Integrations Drives 57% Annualized Revenue Retention",
    subtitle = "Annualized Customer Dollar Retention (C$R) by integration depth | Marketing Hub",
    x        = "Integration Depth",
    y        = "Annualized C$R"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz5_mrr_retention_by_depth.png", width = 10, height = 6, dpi = 300)


# --- VIZ 6: Average MRR by Integration Depth ---
# more integrations = higher value customer

mrr_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(mrr_by_bucket,
       aes(x = int_bucket, y = avg_mrr, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = dollar(avg_mrr, accuracy = 1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = mrr_colors) +
  scale_y_continuous(
    labels = dollar_format(accuracy = 1),
    limits = c(0, 380)
  ) +
  labs(
    title    = "Deeply Integrated Customers Are Worth 3.5x More",
    subtitle = "Average monthly Marketing Hub MRR by integration depth",
    x        = "Integration Depth",
    y        = "Average MRR"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz6_avg_mrr_by_depth.png", width = 10, height = 6, dpi = 300)

5: Deeper Cuts

# ============================================================
# SECTION 5: DEEPER CUTS
# ============================================================

# --- 5.1 TIER x INTEGRATION DEPTH ---
# Are Enterprise customers more deeply integrated?
# This helps CS understand whether integration depth is
# driven by customer sophistication or is an opportunity
# to develop across all tiers

tier_integration <- master %>%
  filter(!is.na(MARKETING_HUB_TIER)) %>%
  group_by(MARKETING_HUB_TIER, int_bucket) %>%
  summarise(
    customer_months = n(),
    .groups = "drop"
  ) %>%
  group_by(MARKETING_HUB_TIER) %>%
  mutate(
    pct_within_tier = customer_months / sum(customer_months)
  ) %>%
  ungroup() %>%
  mutate(MARKETING_HUB_TIER = factor(
    MARKETING_HUB_TIER,
    levels = c("Starter", "Professional", "Enterprise")
  ))

# Tier x integration depth
knitr::kable(tier_integration,
             digits = 3,
             caption = "Integration Depth by Marketing Hub Tier",
             col.names = c("Tier", "Integration Depth",
                           "Customer-Months", "% Within Tier"))
Integration Depth by Marketing Hub Tier
Tier Integration Depth Customer-Months % Within Tier
Enterprise 0 Integrations 38 0.005
Enterprise 1 Integration 179 0.025
Enterprise 2 Integrations 625 0.086
Enterprise 3+ Integrations 6457 0.885
Professional 0 Integrations 297 0.013
Professional 1 Integration 1635 0.070
Professional 2 Integrations 3798 0.163
Professional 3+ Integrations 17631 0.755
Starter 0 Integrations 1386 0.048
Starter 1 Integration 5030 0.172
Starter 2 Integrations 7145 0.245
Starter 3+ Integrations 15614 0.535
# Average integration count by tier
avg_int_by_tier <- master %>%
  filter(!is.na(MARKETING_HUB_TIER)) %>%
  group_by(MARKETING_HUB_TIER) %>%
  summarise(
    avg_integrations = mean(int_count, na.rm = TRUE),
    median_integrations = median(int_count, na.rm = TRUE)
  ) %>%
  mutate(MARKETING_HUB_TIER = factor(
    MARKETING_HUB_TIER,
    levels = c("Starter", "Professional", "Enterprise")
  )) %>%
  arrange(MARKETING_HUB_TIER)

# Average integration count by tier
knitr::kable(avg_int_by_tier,
             digits = 2,
             caption = "Average Integration Count by Tier",
             col.names = c("Tier", "Avg Integrations",
                           "Median Integrations"))
Average Integration Count by Tier
Tier Avg Integrations Median Integrations
Starter 2.75 3
Professional 3.68 4
Enterprise 4.61 5
# --- 5.2 INTEGRATION x RETENTION ---
# Which specific integrations correlate with
# the best MRR retention?
# Approach: for each integration, compare the cancellation
# rate of customers who used it vs. those who did not

# Build a customer-month level flag for each integration
integration_retention <- df_integrations_clean %>%
  distinct(CUSTOMER_ID, SNAPSHOT_MONTH, INTEGRATION) %>%
  left_join(
    master %>%
      select(CUSTOMER_ID, SNAPSHOT_MONTH,
             MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  group_by(INTEGRATION) %>%
  summarise(
    customer_months     = n(),
    cancellation_events = sum(!is.na(MARKETING_HUB_MRR_CANCELED)),
    total_mrr           = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr        = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate   = cancellation_events / customer_months,
    mrr_cancellation_rate = canceled_mrr / total_mrr
  ) %>%
  arrange(cancellation_rate)

# Integration x retention
knitr::kable(integration_retention,
             digits = 4,
             caption = "Cancellation Rate by Integration",
             col.names = c("Integration", "Customer-Months",
                           "Cancellation Events", "Total MRR",
                           "Canceled MRR", "Cancellation Rate",
                           "MRR Cancellation Rate"))
Cancellation Rate by Integration
Integration Customer-Months Cancellation Events Total MRR Canceled MRR Cancellation Rate MRR Cancellation Rate
Zapier 10260 378 3639402 95822.68 0.0368 0.0263
Facebook Ads 13915 724 4527552 183194.00 0.0520 0.0405
Canva 19949 1066 6044117 238670.10 0.0534 0.0395
Google Calendar 16135 864 5014105 212112.69 0.0535 0.0423
Mailchimp 19296 1054 5792311 249575.21 0.0546 0.0431
Gmail 16957 939 5174913 229987.97 0.0554 0.0444
Calendly 14314 797 4601354 218160.08 0.0557 0.0474
Google Ads 18855 1059 5491260 244888.18 0.0562 0.0446
LinkedIn 19786 1131 6090605 263425.37 0.0572 0.0433
Eventbrite 17963 1030 5529064 250285.27 0.0573 0.0453
Typeform 23483 1354 6973749 315886.64 0.0577 0.0453
Instagram 16593 1274 5116288 312826.41 0.0768 0.0611
# --- 5.3 SALES HUB LENS ---
# Does the integration-retention relationship hold
# for Sales Hub customers?
# Note: smaller sample

sales_hub_retention <- master %>%
  filter(!is.na(SALES_HUB_TIER)) %>%
  group_by(int_bucket) %>%
  summarise(
    customer_months     = n(),
    cancellation_events = sum(!is.na(SALES_HUB_MRR_CANCELED)),
    total_mrr           = sum(SALES_HUB_MRR, na.rm = TRUE),
    canceled_mrr        = sum(SALES_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate   = cancellation_events / customer_months,
    avg_mrr             = mean(SALES_HUB_MRR, na.rm = TRUE)
  ) %>%
  arrange(int_bucket)

# Sales Hub retention
knitr::kable(sales_hub_retention,
             digits = 4,
             caption = "Sales Hub Retention by Integration Depth",
             col.names = c("Integration Depth", "Customer-Months",
                           "Cancellation Events", "Total MRR",
                           "Canceled MRR", "Cancellation Rate",
                           "Avg MRR"))
Sales Hub Retention by Integration Depth
Integration Depth Customer-Months Cancellation Events Total MRR Canceled MRR Cancellation Rate Avg MRR
0 Integrations 432 36 142427.5 12343.77 0.0833 329.6933
1 Integration 2009 131 953531.4 42480.21 0.0652 474.6299
2 Integrations 3164 235 1537452.8 95858.05 0.0743 485.9206
3+ Integrations 11463 652 5756460.9 286567.28 0.0569 502.1775

Observations

Tier × Integration Depth: Integration depth scales predictably with tier: 88.5% of Enterprise customers use 3+ integrations compared to 53.5% of Starter customers. However, the more interesting finding is that even at the Starter tier, the majority of customers are deeply integrated. This suggests integrations are a universal retention lever accessible across all tiers, not exclusive to sophisticated high-value accounts.

Integration × Retention: Zapier stands out significantly, with a 3.7% monthly cancellation rate, roughly half the rate of most other integrations. It’s workflow automation nature likely creates operational dependency on HubSpot, resulting in deeper attachment than a data-connection integration. Instagram sits at the opposite end at 7.7%, though this likely reflects a customer profile effect. Smaller, more transactional businesses - rather than a causal relationship between Instagram usage and churn.

Sales Hub Lens: The integration-retention relationship holds directionally in Sales Hub, with cancellation declining from 8.3% at zero integrations to 5.7% at 3+. The 1→2 integration bump observed in Marketing Hub reappears here, suggesting it is a genuine pattern worth further investigation. Given the smaller Sales Hub sample (~27% of customers), these findings should be treated as directional only.

Visualizations

# ============================================================
# SECTION 5: VISUALIZATIONS
# ============================================================

# --- VIZ 7: Integration Depth by Tier (Stacked Bar) ---
# Shows how integration behavior scales with tier
# Stacked to show full composition within each tier

ggplot(tier_integration,
       aes(x = MARKETING_HUB_TIER,
           y = pct_within_tier,
           fill = int_bucket)) +
  geom_col(position = "stack") +
  geom_text(
    aes(label = ifelse(pct_within_tier > 0.05,
                       percent(pct_within_tier, accuracy = 0.1), ""),
        color = ifelse(int_bucket %in% c("0 Integrations", "1 Integration"),
                       hs_navy, "white")),
    position = position_stack(vjust = 0.5),
    size = 3.5, fontface = "bold"
  ) +
  scale_color_identity() +
  scale_fill_manual(values = c(
    "0 Integrations"  = "gray80",
    "1 Integration"   = hs_aqua,
    "2 Integrations"  = hs_orange,
    "3+ Integrations" = hs_navy
  )) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title    = "Enterprise Leads Integration Depth, But 3+ Is Common Across All Tiers",
    subtitle = "% of customer-months by integration bucket within each Marketing Hub tier",
    x        = "Marketing Hub Tier",
    y        = "% of Customer-Months",
    fill     = "Integration Depth"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", color = hs_navy),
    plot.subtitle    = element_text(color = "gray50"),
    axis.text        = element_text(color = hs_navy),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position  = "bottom"
  )

ggsave("viz7_tier_x_depth.png", width = 10, height = 6, dpi = 300)


# --- VIZ 8: Cancellation Rate by Integration ---
# Zapier highlighted in orange as the standout
# Instagram highlighted in navy as the outlier to flag

integration_bar_colors <- integration_retention %>%
  mutate(bar_color = case_when(
    INTEGRATION == "Zapier"    ~ hs_orange,
    INTEGRATION == "Instagram" ~ hs_navy,
    TRUE                       ~ hs_aqua
  )) %>%
  pull(bar_color)

ggplot(integration_retention,
       aes(x = reorder(INTEGRATION, cancellation_rate),
           y = cancellation_rate)) +
  geom_col(fill = integration_bar_colors) +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    hjust = -0.1, size = 3.5, fontface = "bold", color = hs_navy
  ) +
  coord_flip() +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.10)
  ) +
  labs(
    title    = "Zapier Users Cancel at Half the Rate & Instagram Warrants Attention",
    subtitle = "Monthly cancellation rate by integration | Marketing Hub customers",
    x        = NULL,
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.y = element_blank(),
    legend.position    = "none"
  )

ggsave("viz8_cancellation_by_integration.png", width = 10, height = 6, dpi = 300)


# --- VIZ 9: Sales Hub Retention by Integration Depth ---
# Directional comparison — mirrors Marketing Hub finding

sales_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(sales_hub_retention,
       aes(x = int_bucket,
           y = cancellation_rate,
           fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = sales_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.11)
  ) +
  labs(
    title    = "Integration Depth Drives Lower Cancellation in Sales Hub Too",
    subtitle = "Monthly cancellation rate by integration depth | Sales Hub (smaller sample)",
    x        = "Integration Depth",
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz9_sales_hub_retention.png", width = 10, height = 6, dpi = 300)

6: Next Steps and Data Gaps

Key Findings Summary

Finding 1: Integration adoption is near-universal and growing

  • 97.9% of customers used at least one integration as of June 2024, up from 96.3% in January 2024.
  • Average of 3.33 integrations per customer per month.
  • The majority of customers (66.2%) use 3+ integrations, suggesting deep platform embedding is the norm.

Finding 2: Integration depth is strongly associated with lower cancellation rates

  • Customers with 0 integrations cancel at 9.5% monthly vs. 5.6% for customers with 3+ integrations.
  • That represents a ~41% reduction in cancellation rate at peak integration depth.
  • Pattern holds directionally in Sales Hub as well.

Finding 3: Integration depth tracks directly with customer value

  • Average MRR for 0-integration customers: $91.
  • Average MRR for 3+ integration customers: $319.
  • Deeply integrated customers are worth 3.5x more on average — making them both more valuable and more likely to stay.

Finding 4: Zapier is the highest-retention integration; Instagram warrants monitoring

  • Zapier users cancel at just 3.7% monthly - nearly half the rate of most other integrations.
  • Zapier’s workflow automation nature embeds HubSpot into operational processes, creating deeper attachment.
  • Instagram users cancel at 7.7% - the highest of any integration. May reflect a different customer profile rather than a causal relationship.

Finding 5: Integration depth scales with tier but is broadly adopted across all tiers

  • 88.5% of Enterprise customers use 3+ integrations vs. 53.5% of Starter customers.
  • Even at the Starter tier, the majority of customers are deeply integrated - suggesting integrations are a universal retention lever, not exclusive to high-value accounts.

Analytical Limitations

1. Correlation, Not Causation

Integration depth correlates with better retention but does not prove causation. It is possible that customers who intend to stay longer are simply more likely to invest in integrations - rather than integrations themselves driving retention. A controlled experiment or longer longitudinal study would be needed to establish causality.

2. Short Observation Window

Six months of data limits our ability to assess true retention patterns. Churn events often lag integration behavior by months or quarters. A 12–24 month window would produce more reliable C$R calculations.

3. Integration Timing Unknown

We do not know when a customer first adopted an integration relative to their subscription start date or any churn signals. Early integration adoption may be a stronger predictor of retention than late-stage adoption. Time-to-first-integration analysis would require historical onboarding data.

4. Sales Hub Sample Size

Only ~27% of customers have Sales Hub data, making Sales Hub findings directional only. Conclusions should not be generalized without a larger sample.

5. Instagram Interpretation

Instagram’s elevated cancellation rate likely reflects customer profile differences rather than a causal relationship. Smaller, more transactional businesses may gravitate toward social integrations while also having higher natural churn rates. Controlling for tier and company size would clarify this finding.

Proposed Next Steps

Immediate Opportunities (with current data)

1. Identify and Prioritize Zero-Integration Customers

The 283 customers with no integrations represent an immediate CS intervention opportunity. These customers cancel at 9.5% monthly and generate the lowest average MRR. A targeted outreach campaign to drive even one integration adoption could meaningfully improve retention for this segment.

2. Build a Zapier Promotion Strategy

Given Zapier’s outsized retention correlation (3.7% cancellation rate), CS and Product should evaluate whether active promotion of Zapier during onboarding or QBRs could accelerate retention improvement - particularly for Professional and Starter tier customers who use it least.

3. Monitor the Instagram Cohort

Flag Instagram-only customers for proactive CS outreach. Investigate whether this segment skews toward Starter tier and shorter tenure - if so, targeted onboarding support could reduce early churn.

Longer Term Investigations (additional data needed)

4. Time-to-First-Integration Analysis

With onboarding timestamp data, we could measure whether customers who adopt integrations earlier in their lifecycle retain better than late adopters. This would inform whether integration adoption should be a formal onboarding milestone.

5. Integration Sequence Modeling

Do customers who adopt integrations in a specific order (e.g., Gmail → Google Calendar → Zapier) show better retention than those who adopt them in other sequences? Sequence analysis could inform a prescriptive integration journey for CS playbooks.

6. Controlled Retention Experiment

To move from correlation to causation, design an A/B test where one customer cohort receives active integration adoption support during onboarding and another follows the standard path. Measure the 12-month retention delta between cohorts.

7. Expand to Full Product Suite

Current data covers Marketing Hub and Sales Hub only. Extending this analysis to Service Hub, CMS Hub, and Operations Hub would give a complete picture of integration impact across the HubSpot ecosystem.