This workbook documents the full analytical process behind the HubSpot Integrations Model Exercise, from data ingestion and cleaning through to synthesis and recommendations. Each section follows a consistent structure: setup and rationale, code with inline comments, and post-analysis observations interpreting the results. All data is the synthetic HubSpot dataset provided for this exercise (January–June 2024, n=13,651 customers).

1: Setup & Data Ingestion

Data is synthetic and provided by HubSpot for recruitment purposes. This section loads all required libraries and imports the two datasets from Google Drive. Both files contain monthly snapshots of HubSpot customer data across a six month observation period (January–June 2024).

# ============================================================
# SECTION 1: SETUP & DATA INGESTION
# ============================================================

# --- STEP 1: SETUP & PACKAGE INSTALLATION ---

# Install packages (only runs if not already installed)
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("readxl")) install.packages("readxl")
if (!require("scales")) install.packages("scales")
if (!require("googledrive")) install.packages("googledrive")
if (!require("googlesheets4")) install.packages("googlesheets4")

# Load libraries
library(tidyverse)
library(readxl)
library(scales)
library(googledrive)
library(googlesheets4)

# --- STEP 2: GOOGLE DRIVE AUTHORIZATION & FILE IMPORT ---

# Authenticate with Google Drive
# A browser prompt will open asking you to authorize — sign in
# with your Google account and allow access
drive_auth()

# Download files from Google Drive to your local session
drive_download(
  file = "customer_metadata.xlsx",
  path = "customer_metadata.xlsx",
  overwrite = TRUE
)

drive_download(
  file = "customer_integrations.xlsx",
  path = "customer_integrations.xlsx",
  overwrite = TRUE
)

# Load into dataframes
df_metadata <- read_excel("customer_metadata.xlsx")
df_integrations <- read_excel("customer_integrations.xlsx")

# Quick sanity check
glimpse(df_metadata)

## Rows: 62,339
## Columns: 8
## $ CUSTOMER_ID                <chr> "C00020862", "C00020863", "C00020864", "C00…
## $ SNAPSHOT_MONTH             <dttm> 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ MARKETING_HUB_TIER         <chr> "Ent", "STarter", "STarter", "Pro", "Pro", …
## $ MARKETING_HUB_MRR          <chr> "674.14", "14.16", "9.49", "346.97", "295.9…
## $ MARKETING_HUB_MRR_CANCELED <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…
## $ SALES_HUB_TIER             <chr> "Enterprise", "NA", "NA", "NA", "NA", "NA",…
## $ SALES_HUB_MRR              <chr> "-17.55", "NA", "NA", "NA", "NA", "NA", "NA…
## $ SALES_HUB_MRR_CANCELED     <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…

glimpse(df_integrations)

## Rows: 207,506
## Columns: 3
## $ CUSTOMER_ID    <chr> "C00020862", "C00020862", "C00020862", "C00020862", "C0…
## $ SNAPSHOT_MONTH <dttm> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ INTEGRATION    <chr> "GoogleCalendar", "FacebookAds", "GoogleAds", "Calendly…

2: Data Cleaning

Pre-Clean Observations

Upon initial inspection of both datasets, the following issues were identified and will be addressed before analysis:

Customer Metadata:

MARKETING_HUB_TIER contains inconsistent naming conventions for the same tiers: Ent / Enterprise, STarter / Starter, Pro / Professional - will normalize to a single standard.
MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED, SALES_HUB_MRR, and SALES_HUB_MRR_CANCELED were imported as character strings instead of numeric. This is due to "NA" being stored as literal text rather than a true NA value. Will convert to numeric after replacing.
SALES_HUB_TIER also contains string "NA" values where a customer has no Sales Hub product - will convert to true NA.
SALES_HUB_MRR contains at least one negative value (-17.55), which may represent a credit or adjustment. Will flag and monitor but not remove, as it may be intentional.

Customer Integrations:

INTEGRATION contains duplicate naming formats for three integrations: GoogleAds / Google Ads, FacebookAds / Facebook Ads, GoogleCalendar / Google Calendar - will standardize to spaced format for consistency and display.

Cleaning

# ============================================================
# SECTION 2: DATA CLEANING
# ============================================================

## --- CLEANING: CUSTOMER_METADATA ---

df_metadata_clean <- df_metadata %>%
  
  # 1. Normalize Marketing Hub tier naming
  mutate(MARKETING_HUB_TIER = case_when(
    MARKETING_HUB_TIER %in% c("Ent", "Enterprise") ~ "Enterprise",
    MARKETING_HUB_TIER %in% c("STarter", "Starter") ~ "Starter",
    MARKETING_HUB_TIER %in% c("Pro", "Professional") ~ "Professional",
    TRUE ~ NA_character_
  )) %>%
  
  # 2. Normalize Sales Hub tier naming (already cleaner but apply same pattern)
  mutate(SALES_HUB_TIER = case_when(
    SALES_HUB_TIER == "Enterprise" ~ "Enterprise",
    SALES_HUB_TIER == "Starter" ~ "Starter",
    SALES_HUB_TIER == "Professional" ~ "Professional",
    TRUE ~ NA_character_
  )) %>%
  
  # 3. Replace string "NA" with true NA, then convert MRR columns to numeric
  mutate(across(
    c(MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED,
      SALES_HUB_MRR, SALES_HUB_MRR_CANCELED),
    ~ as.numeric(na_if(., "NA"))
  )) %>%
  
  # 4. Ensure SNAPSHOT_MONTH is stored as Date (not datetime)
  mutate(SNAPSHOT_MONTH = as.Date(SNAPSHOT_MONTH))


# --- CLEANING: CUSTOMER_INTEGRATIONS ---

df_integrations_clean <- df_integrations %>%
  
  # 5. Standardize integration names to spaced format
  mutate(INTEGRATION = case_when(
    INTEGRATION == "GoogleAds" ~ "Google Ads",
    INTEGRATION == "FacebookAds" ~ "Facebook Ads",
    INTEGRATION == "GoogleCalendar" ~ "Google Calendar",
    TRUE ~ INTEGRATION
  )) %>%

  # 6. Ensure SNAPSHOT_MONTH is stored as Date
  mutate(SNAPSHOT_MONTH = as.Date(SNAPSHOT_MONTH))


# --- JOIN VALIDATION ---

# Check unique customers in each dataset
meta_customers <- df_metadata_clean %>% distinct(CUSTOMER_ID)
int_customers  <- df_integrations_clean %>% distinct(CUSTOMER_ID)


# --- POST-CLEANING VALIDATION ---

# Confirm Normalization --- Marketing Hub Tiers ---
print(table(df_metadata_clean$MARKETING_HUB_TIER, useNA = "always"))

## 
##   Enterprise Professional      Starter         <NA> 
##         7299        23361        29175         2504

# Confirm Normalization --- Sales Hub Tiers ---
print(table(df_metadata_clean$SALES_HUB_TIER, useNA = "always"))

## 
##   Enterprise Professional      Starter         <NA> 
##         4918         6049         6101        45271

# Confirm --- Integration Names ---
print(sort(unique(df_integrations_clean$INTEGRATION)))

##  [1] "Calendly"        "Canva"           "Eventbrite"      "Facebook Ads"   
##  [5] "Gmail"           "Google Ads"      "Google Calendar" "Instagram"      
##  [9] "LinkedIn"        "Mailchimp"       "Typeform"        "Zapier"

# Confirm --- MRR Column Types ---
print(sapply(df_metadata_clean[, c(
            "MARKETING_HUB_MRR", 
            "MARKETING_HUB_MRR_CANCELED", 
            "SALES_HUB_MRR", 
            "SALES_HUB_MRR_CANCELED")], class))

##          MARKETING_HUB_MRR MARKETING_HUB_MRR_CANCELED 
##                  "numeric"                  "numeric" 
##              SALES_HUB_MRR     SALES_HUB_MRR_CANCELED 
##                  "numeric"                  "numeric"

# Check Count --- Negative MRR Values in Sales Hub ---
df_metadata_clean %>%
  filter(!is.na(SALES_HUB_MRR) & SALES_HUB_MRR < 0) %>%
  summarise(count = n(), total_negative_mrr = sum(SALES_HUB_MRR)) %>%
  print()

## # A tibble: 1 × 2
##   count total_negative_mrr
##   <int>              <dbl>
## 1     4              -70.2

# Review Coverage Counts
cat("--- Customer Coverage ---\n",
    "Unique customers in metadata:", 
    nrow(meta_customers), "\n",
    "Unique customers in integrations:", 
    nrow(int_customers), "\n",
    "Customers in BOTH datasets:", 
    nrow(inner_join(meta_customers, int_customers, by = "CUSTOMER_ID")), "\n",
    "Customers in metadata ONLY (no integrations):", 
    nrow(anti_join(meta_customers, int_customers, by = "CUSTOMER_ID")), "\n",
    "Customers in integrations ONLY (not in metadata):", 
    nrow(anti_join(int_customers, meta_customers, by = "CUSTOMER_ID")), "\n")

## --- Customer Coverage ---
##  Unique customers in metadata: 13651 
##  Unique customers in integrations: 13368 
##  Customers in BOTH datasets: 13368 
##  Customers in metadata ONLY (no integrations): 283 
##  Customers in integrations ONLY (not in metadata): 0

# Check for duplicate CUSTOMER_ID + SNAPSHOT_MONTH in metadata
# (should be zero — one row per customer per month)
meta_dupes <- df_metadata_clean %>%
  group_by(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
  filter(n() > 1) %>%
  ungroup()

cat("--- Metadata Duplicate Check ---\n",
    "Duplicate CUSTOMER_ID + SNAPSHOT_MONTH rows:", nrow(meta_dupes), "\n")

## --- Metadata Duplicate Check ---
##  Duplicate CUSTOMER_ID + SNAPSHOT_MONTH rows: 0

Post-Clean Observations

Marketing Hub Tiers: All three tier values normalized cleanly into Enterprise, Professional, and Starter. The 2,504 NA values are expected and represent customers who hold a Sales Hub subscription only, with no Marketing Hub product.

Sales Hub Tiers: Normalization applied successfully. The large NA count (45,271) is expected. The majority of customers in this dataset are Marketing Hub only, which is confirmed by the ~73% NA rate in SALES_HUB_TIER. As a result, Sales Hub analysis will be treated as directional only throughout this workbook, with Marketing Hub MRR serving as the primary and more statistically reliable signal.

Integration Names: All 12 integrations are now cleanly and distinctly named. The three duplicate naming formats (GoogleAds, FacebookAds, GoogleCalendar) have been standardized to their spaced equivalents for consistency and display readability.

MRR Column Types: All four MRR columns (MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED, SALES_HUB_MRR, SALES_HUB_MRR_CANCELED) successfully converted from character strings to numeric. String "NA" values have been replaced with true R NA.

Negative MRR Values: Four rows were identified with a combined Sales Hub MRR of -$70.20. Given the scale of the dataset (62,000+ rows), this is negligible and most likely represents a credit or billing adjustment. These rows have been flagged and will be monitored but not removed, as removal could introduce bias into aggregated MRR calculations.

Join Validation: The join between both datasets is clean and behaves as expected:

13,651 unique customers in metadata
13,368 customers appear in both datasets, confirming at least one active integration
283 customers appear in metadata only - these are zero-integration customers (~2% of the base) and will serve as the baseline comparison group throughout the retention analysis
0 customers appear in integrations only, confirming a clean one-to-many relationship with no orphaned records
0 duplicate CUSTOMER_ID + SNAPSHOT_MONTH combinations - the metadata grain is confirmed as one row per customer per month, making it safe to join without aggregation concerns

3: Integration Landscape

# ============================================================
# SECTION 3: INTEGRATION LANDSCAPE
# ============================================================

# --- 3.1 OVERALL ADOPTION RATE ---
# What percentage of total customer base used at least
# one integration in any given month?

adoption_by_month <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      distinct(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
      mutate(has_integration = TRUE),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  mutate(has_integration = replace_na(has_integration, FALSE)) %>%
  group_by(SNAPSHOT_MONTH) %>%
  summarise(
    total_customers     = n_distinct(CUSTOMER_ID),
    with_integration    = sum(has_integration),
    adoption_rate       = with_integration / total_customers
  )

# Adoption by month
knitr::kable(adoption_by_month,
             digits = 3,
             caption = "Monthly Integration Adoption Rate",
             col.names = c("Snapshot Month", "Total Customers", 
                           "With Integration", "Adoption Rate"))

Monthly Integration Adoption Rate
Snapshot Month	Total Customers	With Integration	Adoption Rate
2024-01-01	9996	9622	0.963
2024-02-01	10070	9726	0.966
2024-03-01	10229	9921	0.970
2024-04-01	10395	10119	0.973
2024-05-01	10675	10404	0.975
2024-06-01	10974	10747	0.979

# Overall adoption rate across full period
overall_adoption <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      distinct(CUSTOMER_ID) %>%
      mutate(has_integration = TRUE),
    by = "CUSTOMER_ID"
  ) %>%
  mutate(has_integration = replace_na(has_integration, FALSE)) %>%
  summarise(
    total_customers  = n_distinct(CUSTOMER_ID),
    with_integration = sum(has_integration) / 6, # divide by 6 months
    adoption_rate    = (n_distinct(CUSTOMER_ID[has_integration]) / 
                        n_distinct(CUSTOMER_ID))
  )

# Overall adoption
knitr::kable(overall_adoption,
             digits = 3,
             caption = "Overall Adoption Rate (Full Period)",
             col.names = c("Total Customers", "With Integration", 
                           "Adoption Rate"))

Overall Adoption Rate (Full Period)
Total Customers	With Integration	Adoption Rate
13651	10221.33	0.979

# --- 3.2 INTEGRATION POPULARITY & TRENDS ---
# Which integrations are most used, and how has usage
# trended across the 6-month observation period?

# Overall popularity ranking
integration_popularity <- df_integrations_clean %>%
  group_by(INTEGRATION) %>%
  summarise(
    total_usage      = n(),
    unique_customers = n_distinct(CUSTOMER_ID)
  ) %>%
  arrange(desc(unique_customers))

# --- Integration Popularity Ranking ---
knitr::kable(integration_popularity %>% 
             select(INTEGRATION, total_usage, unique_customers),
             digits = 0,
             caption = "Integration Popularity Ranking",
             col.names = c("Integration", "Total Usage", 
                           "Unique Customers"))

Integration Popularity Ranking
Integration	Total Usage	Unique Customers
Mailchimp	19296	5528
Typeform	23483	5416
Canva	19949	4598
LinkedIn	19786	4590
Google Ads	18855	4390
Eventbrite	17963	4185
Facebook Ads	13915	4142
Instagram	16593	4047
Gmail	16957	3954
Google Calendar	16135	3729
Calendly	14314	3404
Zapier	10260	2365

# Monthly trend by integration
integration_trends <- df_integrations_clean %>%
  group_by(SNAPSHOT_MONTH, INTEGRATION) %>%
  summarise(
    unique_customers = n_distinct(CUSTOMER_ID),
    .groups = "drop"
  )

# --- Monthly Trends by Integration (first rows) ---
print(head(integration_trends, 20))

## # A tibble: 20 × 3
##    SNAPSHOT_MONTH INTEGRATION     unique_customers
##    <date>         <chr>                      <int>
##  1 2024-01-01     Calendly                    2180
##  2 2024-01-01     Canva                       3173
##  3 2024-01-01     Eventbrite                  2803
##  4 2024-01-01     Facebook Ads                1301
##  5 2024-01-01     Gmail                       2632
##  6 2024-01-01     Google Ads                  2952
##  7 2024-01-01     Google Calendar             2516
##  8 2024-01-01     Instagram                   2721
##  9 2024-01-01     LinkedIn                    3102
## 10 2024-01-01     Mailchimp                   2015
## 11 2024-01-01     Typeform                    3708
## 12 2024-01-01     Zapier                      1455
## 13 2024-02-01     Calendly                    2230
## 14 2024-02-01     Canva                       3205
## 15 2024-02-01     Eventbrite                  2851
## 16 2024-02-01     Facebook Ads                1638
## 17 2024-02-01     Gmail                       2694
## 18 2024-02-01     Google Ads                  2993
## 19 2024-02-01     Google Calendar             2560
## 20 2024-02-01     Instagram                   2684

# --- 3.3 INTEGRATION COUNT DISTRIBUTION ---
# How many integrations does each customer use per month?
# Bucketed into 0, 1, 2, 3+ for cleaner storytelling.

integration_counts <- df_metadata_clean %>%
  left_join(
    df_integrations_clean %>%
      group_by(CUSTOMER_ID, SNAPSHOT_MONTH) %>%
      summarise(int_count = n_distinct(INTEGRATION), .groups = "drop"),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  mutate(
    int_count  = replace_na(int_count, 0),
    int_bucket = case_when(
      int_count == 0 ~ "0 Integrations",
      int_count == 1 ~ "1 Integration",
      int_count == 2 ~ "2 Integrations",
      int_count >= 3 ~ "3+ Integrations"
    ),
    int_bucket = factor(int_bucket, levels = c(
      "0 Integrations", "1 Integration",
      "2 Integrations", "3+ Integrations"
    ))
  )

# Distribution summary
int_distribution <- integration_counts %>%
  group_by(int_bucket) %>%
  summarise(
    customer_months = n(),
    pct = customer_months / nrow(integration_counts)
  )

# Integration distribution
knitr::kable(int_distribution,
             digits = 3,
             caption = "Integration Count Distribution",
             col.names = c("Integration Depth", "Customer-Months", 
                           "% of Total"))

Integration Count Distribution
Integration Depth	Customer-Months	% of Total
0 Integrations	1800	0.029
1 Integration	7165	0.115
2 Integrations	12078	0.194
3+ Integrations	41296	0.662

# note: This is a customer-month table, not distinct customers.
# With 13,651 customers × 6 months = ~62,339
# It tells us the share of customer-months spent at each integration depth.


# A--- Average Integrations Per Customer Per Month ---
integration_counts %>%
  summarise(avg_integrations = mean(int_count)) %>%
  print()

## # A tibble: 1 × 1
##   avg_integrations
##              <dbl>
## 1             3.33

Observations

Adoption Rate: Integration adoption is near-universal, ranging from 96.3% in January to 97.9% by June 2024 and trending upward each month. This suggests integrations are not a niche feature, they are a core behavior across the customer base.

Integration Popularity: Mailchimp and Typeform lead in unique customers, while Zapier ranks last. Notably, Mailchimp’s high unique customer count relative to its total usage rows suggests broad but moderate adoption. Zapier’s low ranking is worth investigating further - as a workflow automation tool, it may represent an underutilized but high-value integration given its operational requirement.

Integration Depth: 66.2% of customers use 3 or more integrations per month, with an average of 3.33 integrations per customer. This signals genuine platform embedding rather than casual or exploratory usage. The zero-integration segment is small (2.9%, 283 customers) but analytically important as a baseline comparison group.

Visualizations

# ============================================================
# SECTION 3: VISUALIZATIONS
# ============================================================

hs_orange <- "#FF7A59"
hs_aqua   <- "#00BDA5"
hs_navy   <- "#2D3E50"
hs_peach  <- "#FFF1EE"

# --- VIZ 1: Monthly Adoption Rate Trend ---
# Using orange line on peach background band for brand colors

ggplot(adoption_by_month, aes(x = SNAPSHOT_MONTH, y = adoption_rate)) +
  annotate("rect",
    xmin = min(adoption_by_month$SNAPSHOT_MONTH) - 15,
    xmax = max(adoption_by_month$SNAPSHOT_MONTH) + 15,
    ymin = 0.95, ymax = 1.00,
    fill = hs_peach, alpha = 0.5
  ) +
  geom_line(color = hs_orange, linewidth = 1.5) +
  geom_point(color = hs_orange, size = 3) +
  geom_text(
    aes(label = percent(adoption_rate, accuracy = 0.1)),
    vjust = -1.2, size = 3.5, fontface = "bold", color = hs_navy
  ) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0.95, 1.00)
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title    = "Integration Adoption Rate is Growing Month Over Month",
    subtitle = "% of customers with at least one active integration",
    x        = NULL,
    y        = "Adoption Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", color = hs_navy),
    plot.subtitle    = element_text(color = "gray50"),
    axis.text        = element_text(color = hs_navy),
    panel.grid.minor = element_blank()
  )

ggsave("viz1_adoption_trend.png", width = 10, height = 6, dpi = 300)


# --- VIZ 2: Integration Popularity Ranking ---
# Highlight top 3 in orange, remaining in aqua to create
# visual hierarchy without losing brand cohesion

integration_popularity <- integration_popularity %>%
  mutate(bar_color = ifelse(
    rank(desc(unique_customers)) <= 3, hs_orange, hs_aqua
  ))

ggplot(integration_popularity,
       aes(x = reorder(INTEGRATION, unique_customers),
           y = unique_customers,
           fill = bar_color)) +
  geom_col() +
  geom_text(
    aes(label = comma(unique_customers)),
    hjust = -0.1, size = 3.5, color = hs_navy
  ) +
  scale_fill_identity() +
  coord_flip() +
  scale_y_continuous(
    labels = comma_format(),
    limits = c(0, max(integration_popularity$unique_customers) * 1.15)
  ) +
  labs(
    title    = "Mailchimp and Typeform Lead Integration Adoption",
    subtitle = "Unique customers using each integration (Jan–Jun 2024)",
    x        = NULL,
    y        = "Unique Customers"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.y = element_blank()
  )

ggsave("viz2_integration_popularity.png", width = 10, height = 6, dpi = 300)

# --- VIZ 3: Integration Count Distribution ---
# Gradient effect: deepen color as integration depth increases
# to visually reinforce "more is better"

dist_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_orange,
  "3+ Integrations" = hs_navy
)

ggplot(int_distribution,
       aes(x = int_bucket, y = pct, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(pct, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = dist_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.75)
  ) +
  labs(
    title    = "Most Customers Use 3 or More Integrations",
    subtitle = "Distribution of integration depth per customer-month",
    x        = "Integration Depth",
    y        = "% of Customer-Months"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz3_integration_distribution.png", width = 10, height = 6, dpi = 300)

4: Do Integrations Matter?

# ============================================================
# SECTION 4: DO INTEGRATIONS MATTER? (CORE ANALYSIS)
# ============================================================

# --- SETUP: Build the master analysis table ---
# Join metadata with integration depth buckets created in 
# Section 3, so every customer-month has an integration
# bucket label alongside their MRR and cancellation data

master <- df_metadata_clean %>%
  left_join(
    integration_counts %>%
      select(CUSTOMER_ID, SNAPSHOT_MONTH, int_count, int_bucket),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  )

# Confirm Join --- Master Table Dimensions --
glimpse(master)

## Rows: 62,339
## Columns: 10
## $ CUSTOMER_ID                <chr> "C00020862", "C00020863", "C00020864", "C00…
## $ SNAPSHOT_MONTH             <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ MARKETING_HUB_TIER         <chr> "Enterprise", "Starter", "Starter", "Profes…
## $ MARKETING_HUB_MRR          <dbl> 674.14, 14.16, 9.49, 346.97, 295.94, 19.18,…
## $ MARKETING_HUB_MRR_CANCELED <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SALES_HUB_TIER             <chr> "Enterprise", NA, NA, NA, NA, NA, NA, NA, N…
## $ SALES_HUB_MRR              <dbl> -17.55, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SALES_HUB_MRR_CANCELED     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ int_count                  <int> 6, 2, 3, 2, 4, 3, 2, 3, 0, 1, 3, 1, 3, 4, 0…
## $ int_bucket                 <fct> 3+ Integrations, 2 Integrations, 3+ Integra…

# --- 4.1 CANCELLATION RATE BY INTEGRATION DEPTH ---
# Cancellation rate = canceled MRR / total MRR
# Using Marketing Hub as primary lens (dominant product)
# A customer-month is a "cancellation event" if
# MARKETING_HUB_MRR_CANCELED is not NA (i.e. has a value)

cancellation_by_bucket <- master %>%
  group_by(int_bucket) %>%
  summarise(
    total_customer_months  = n(),
    cancellation_events    = sum(!is.na(MARKETING_HUB_MRR_CANCELED)),
    total_mrr              = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr           = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate      = cancellation_events / total_customer_months,
    mrr_cancellation_rate  = canceled_mrr / total_mrr
  ) %>%
  arrange(int_bucket)

# --- Cancellation Rate by Integration Depth ---
knitr::kable(cancellation_by_bucket, 
             digits = 4,
             caption = "Cancellation Rate by Integration Depth")

Cancellation Rate by Integration Depth
int_bucket	total_customer_months	cancellation_events	total_mrr	canceled_mrr	cancellation_rate	mrr_cancellation_rate
0 Integrations	1800	171	157247.8	13195.61	0.0950	0.0839
1 Integration	7165	519	814699.0	50777.29	0.0724	0.0623
2 Integrations	12078	959	2023784.4	151109.68	0.0794	0.0747
3+ Integrations	41296	2325	12663783.2	580640.17	0.0563	0.0459

# --- 4.2 MRR RETENTION BY INTEGRATION DEPTH ---
# Customer Dollar Retention (C$R) as defined in brief:
# (1 - (lost value / starting value)) ^ 12
# First calculate monthly retention, then annualize

mrr_retention_by_bucket <- master %>%
  group_by(int_bucket, SNAPSHOT_MONTH) %>%
  summarise(
    starting_mrr      = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr      = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    monthly_retention = 1 - (canceled_mrr / starting_mrr),
    .groups = "drop"
  ) %>%
  group_by(int_bucket) %>%
  summarise(
    avg_monthly_retention  = mean(monthly_retention, na.rm = TRUE),
    annualized_retention   = avg_monthly_retention ^ 12
  ) %>%
  arrange(int_bucket)

# MRR retention by bucket
knitr::kable(mrr_retention_by_bucket,
             digits = 4,
             caption = "MRR Retention by Integration Depth",
             col.names = c("Integration Depth", "Avg Monthly Retention",
                           "Annualized Retention (C$R)"))

MRR Retention by Integration Depth
Integration Depth	Avg Monthly Retention	Annualized Retention (C$R)
0 Integrations	0.9170	0.3536
1 Integration	0.9378	0.4629
2 Integrations	0.9254	0.3942
3+ Integrations	0.9541	0.5688

# --- 4.3 AVERAGE MRR BY INTEGRATION DEPTH ---
# Are higher-integration customers also higher value?
# Important context for VP of CS

mrr_by_bucket <- master %>%
  group_by(int_bucket) %>%
  summarise(
    avg_mrr    = mean(MARKETING_HUB_MRR, na.rm = TRUE),
    median_mrr = median(MARKETING_HUB_MRR, na.rm = TRUE),
    total_mrr  = sum(MARKETING_HUB_MRR, na.rm = TRUE)
  ) %>%
  arrange(int_bucket)

# MRR by bucket
knitr::kable(mrr_by_bucket,
             digits = 2,
             caption = "Average MRR by Integration Depth",
             col.names = c("Integration Depth", "Avg MRR",
                           "Median MRR", "Total MRR"))

Average MRR by Integration Depth
Integration Depth	Avg MRR	Median MRR	Total MRR
0 Integrations	91.37	15.66	157247.8
1 Integration	119.04	17.92	814699.0
2 Integrations	174.95	25.62	2023784.4
3+ Integrations	318.97	295.57	12663783.2

Observations

Cancellation Rate by Integration Depth: A clear pattern emerges: customers with zero integrations cancel at 9.5% monthly, declining to 5.6% for customers with 3+ integrations. That represents a ~41% reduction in cancellation rate at peak integration depth. One nuance worth noting: the 2-integration bucket (7.9%) slightly underperforms the 1-integration bucket (7.2%), which reappears in later analysis and appears to be a consistent pattern rather than noise.

Annualized Retention: The gap widens significantly when annualized. Customers with zero integrations retain at 35.4% annually versus 56.9% for customers with 3+ integrations (a 21.5 percentage point difference) that represents a meaningful business impact at scale.

Average MRR: Deeply integrated customers are not only more likely to stay, they are worth significantly more. Average MRR climbs from $91 for zero-integration customers to $319 for those with 3+ integrations, a 3.5x difference. This combination of higher value and lower churn makes the 3+ integration segment the most strategically important cohort in the dataset.

Visualizations

# ============================================================
# SECTION 4: VISUALIZATIONS
# ============================================================

# --- VIZ 4: Cancellation Rate by Integration Depth ---
# Highlight the 3+ bucket in orange, that's the headline

cancellation_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(cancellation_by_bucket,
       aes(x = int_bucket, y = cancellation_rate, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = cancellation_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.12)
  ) +
  labs(
    title    = "Customers with 3+ Integrations Cancel at Nearly Half the Rate",
    subtitle = "Monthly cancellation rate by integration depth | Marketing Hub",
    x        = "Integration Depth",
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz4_cancellation_by_depth.png", width = 10, height = 6, dpi = 300)

# --- VIZ 5: Annualized MRR Retention by Integration Depth ---

retention_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(mrr_retention_by_bucket,
       aes(x = int_bucket, y = annualized_retention, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(annualized_retention, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = retention_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.70)
  ) +
  labs(
    title    = "3+ Integrations Drives 57% Annualized Revenue Retention",
    subtitle = "Annualized Customer Dollar Retention (C$R) by integration depth | Marketing Hub",
    x        = "Integration Depth",
    y        = "Annualized C$R"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz5_mrr_retention_by_depth.png", width = 10, height = 6, dpi = 300)


# --- VIZ 6: Average MRR by Integration Depth ---
# more integrations = higher value customer

mrr_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(mrr_by_bucket,
       aes(x = int_bucket, y = avg_mrr, fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = dollar(avg_mrr, accuracy = 1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = mrr_colors) +
  scale_y_continuous(
    labels = dollar_format(accuracy = 1),
    limits = c(0, 380)
  ) +
  labs(
    title    = "Deeply Integrated Customers Are Worth 3.5x More",
    subtitle = "Average monthly Marketing Hub MRR by integration depth",
    x        = "Integration Depth",
    y        = "Average MRR"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz6_avg_mrr_by_depth.png", width = 10, height = 6, dpi = 300)

5: Deeper Cuts

# ============================================================
# SECTION 5: DEEPER CUTS
# ============================================================

# --- 5.1 TIER x INTEGRATION DEPTH ---
# Are Enterprise customers more deeply integrated?
# This helps CS understand whether integration depth is
# driven by customer sophistication or is an opportunity
# to develop across all tiers

tier_integration <- master %>%
  filter(!is.na(MARKETING_HUB_TIER)) %>%
  group_by(MARKETING_HUB_TIER, int_bucket) %>%
  summarise(
    customer_months = n(),
    .groups = "drop"
  ) %>%
  group_by(MARKETING_HUB_TIER) %>%
  mutate(
    pct_within_tier = customer_months / sum(customer_months)
  ) %>%
  ungroup() %>%
  mutate(MARKETING_HUB_TIER = factor(
    MARKETING_HUB_TIER,
    levels = c("Starter", "Professional", "Enterprise")
  ))

# Tier x integration depth
knitr::kable(tier_integration,
             digits = 3,
             caption = "Integration Depth by Marketing Hub Tier",
             col.names = c("Tier", "Integration Depth",
                           "Customer-Months", "% Within Tier"))

Integration Depth by Marketing Hub Tier
Tier	Integration Depth	Customer-Months	% Within Tier
Enterprise	0 Integrations	38	0.005
Enterprise	1 Integration	179	0.025
Enterprise	2 Integrations	625	0.086
Enterprise	3+ Integrations	6457	0.885
Professional	0 Integrations	297	0.013
Professional	1 Integration	1635	0.070
Professional	2 Integrations	3798	0.163
Professional	3+ Integrations	17631	0.755
Starter	0 Integrations	1386	0.048
Starter	1 Integration	5030	0.172
Starter	2 Integrations	7145	0.245
Starter	3+ Integrations	15614	0.535

# Average integration count by tier
avg_int_by_tier <- master %>%
  filter(!is.na(MARKETING_HUB_TIER)) %>%
  group_by(MARKETING_HUB_TIER) %>%
  summarise(
    avg_integrations = mean(int_count, na.rm = TRUE),
    median_integrations = median(int_count, na.rm = TRUE)
  ) %>%
  mutate(MARKETING_HUB_TIER = factor(
    MARKETING_HUB_TIER,
    levels = c("Starter", "Professional", "Enterprise")
  )) %>%
  arrange(MARKETING_HUB_TIER)

# Average integration count by tier
knitr::kable(avg_int_by_tier,
             digits = 2,
             caption = "Average Integration Count by Tier",
             col.names = c("Tier", "Avg Integrations",
                           "Median Integrations"))

Average Integration Count by Tier
Tier	Avg Integrations	Median Integrations
Starter	2.75	3
Professional	3.68	4
Enterprise	4.61	5

# --- 5.2 INTEGRATION x RETENTION ---
# Which specific integrations correlate with
# the best MRR retention?
# Approach: for each integration, compare the cancellation
# rate of customers who used it vs. those who did not

# Build a customer-month level flag for each integration
integration_retention <- df_integrations_clean %>%
  distinct(CUSTOMER_ID, SNAPSHOT_MONTH, INTEGRATION) %>%
  left_join(
    master %>%
      select(CUSTOMER_ID, SNAPSHOT_MONTH,
             MARKETING_HUB_MRR, MARKETING_HUB_MRR_CANCELED),
    by = c("CUSTOMER_ID", "SNAPSHOT_MONTH")
  ) %>%
  group_by(INTEGRATION) %>%
  summarise(
    customer_months     = n(),
    cancellation_events = sum(!is.na(MARKETING_HUB_MRR_CANCELED)),
    total_mrr           = sum(MARKETING_HUB_MRR, na.rm = TRUE),
    canceled_mrr        = sum(MARKETING_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate   = cancellation_events / customer_months,
    mrr_cancellation_rate = canceled_mrr / total_mrr
  ) %>%
  arrange(cancellation_rate)

# Integration x retention
knitr::kable(integration_retention,
             digits = 4,
             caption = "Cancellation Rate by Integration",
             col.names = c("Integration", "Customer-Months",
                           "Cancellation Events", "Total MRR",
                           "Canceled MRR", "Cancellation Rate",
                           "MRR Cancellation Rate"))

Cancellation Rate by Integration
Integration	Customer-Months	Cancellation Events	Total MRR	Canceled MRR	Cancellation Rate	MRR Cancellation Rate
Zapier	10260	378	3639402	95822.68	0.0368	0.0263
Facebook Ads	13915	724	4527552	183194.00	0.0520	0.0405
Canva	19949	1066	6044117	238670.10	0.0534	0.0395
Google Calendar	16135	864	5014105	212112.69	0.0535	0.0423
Mailchimp	19296	1054	5792311	249575.21	0.0546	0.0431
Gmail	16957	939	5174913	229987.97	0.0554	0.0444
Calendly	14314	797	4601354	218160.08	0.0557	0.0474
Google Ads	18855	1059	5491260	244888.18	0.0562	0.0446
LinkedIn	19786	1131	6090605	263425.37	0.0572	0.0433
Eventbrite	17963	1030	5529064	250285.27	0.0573	0.0453
Typeform	23483	1354	6973749	315886.64	0.0577	0.0453
Instagram	16593	1274	5116288	312826.41	0.0768	0.0611

# --- 5.3 SALES HUB LENS ---
# Does the integration-retention relationship hold
# for Sales Hub customers?
# Note: smaller sample

sales_hub_retention <- master %>%
  filter(!is.na(SALES_HUB_TIER)) %>%
  group_by(int_bucket) %>%
  summarise(
    customer_months     = n(),
    cancellation_events = sum(!is.na(SALES_HUB_MRR_CANCELED)),
    total_mrr           = sum(SALES_HUB_MRR, na.rm = TRUE),
    canceled_mrr        = sum(SALES_HUB_MRR_CANCELED, na.rm = TRUE),
    cancellation_rate   = cancellation_events / customer_months,
    avg_mrr             = mean(SALES_HUB_MRR, na.rm = TRUE)
  ) %>%
  arrange(int_bucket)

# Sales Hub retention
knitr::kable(sales_hub_retention,
             digits = 4,
             caption = "Sales Hub Retention by Integration Depth",
             col.names = c("Integration Depth", "Customer-Months",
                           "Cancellation Events", "Total MRR",
                           "Canceled MRR", "Cancellation Rate",
                           "Avg MRR"))

Sales Hub Retention by Integration Depth
Integration Depth	Customer-Months	Cancellation Events	Total MRR	Canceled MRR	Cancellation Rate	Avg MRR
0 Integrations	432	36	142427.5	12343.77	0.0833	329.6933
1 Integration	2009	131	953531.4	42480.21	0.0652	474.6299
2 Integrations	3164	235	1537452.8	95858.05	0.0743	485.9206
3+ Integrations	11463	652	5756460.9	286567.28	0.0569	502.1775

Observations

Tier × Integration Depth: Integration depth scales predictably with tier: 88.5% of Enterprise customers use 3+ integrations compared to 53.5% of Starter customers. However, the more interesting finding is that even at the Starter tier, the majority of customers are deeply integrated. This suggests integrations are a universal retention lever accessible across all tiers, not exclusive to sophisticated high-value accounts.

Integration × Retention: Zapier stands out significantly, with a 3.7% monthly cancellation rate, roughly half the rate of most other integrations. It’s workflow automation nature likely creates operational dependency on HubSpot, resulting in deeper attachment than a data-connection integration. Instagram sits at the opposite end at 7.7%, though this likely reflects a customer profile effect. Smaller, more transactional businesses - rather than a causal relationship between Instagram usage and churn.

Sales Hub Lens: The integration-retention relationship holds directionally in Sales Hub, with cancellation declining from 8.3% at zero integrations to 5.7% at 3+. The 1→2 integration bump observed in Marketing Hub reappears here, suggesting it is a genuine pattern worth further investigation. Given the smaller Sales Hub sample (~27% of customers), these findings should be treated as directional only.

Visualizations

# ============================================================
# SECTION 5: VISUALIZATIONS
# ============================================================

# --- VIZ 7: Integration Depth by Tier (Stacked Bar) ---
# Shows how integration behavior scales with tier
# Stacked to show full composition within each tier

ggplot(tier_integration,
       aes(x = MARKETING_HUB_TIER,
           y = pct_within_tier,
           fill = int_bucket)) +
  geom_col(position = "stack") +
  geom_text(
    aes(label = ifelse(pct_within_tier > 0.05,
                       percent(pct_within_tier, accuracy = 0.1), ""),
        color = ifelse(int_bucket %in% c("0 Integrations", "1 Integration"),
                       hs_navy, "white")),
    position = position_stack(vjust = 0.5),
    size = 3.5, fontface = "bold"
  ) +
  scale_color_identity() +
  scale_fill_manual(values = c(
    "0 Integrations"  = "gray80",
    "1 Integration"   = hs_aqua,
    "2 Integrations"  = hs_orange,
    "3+ Integrations" = hs_navy
  )) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title    = "Enterprise Leads Integration Depth, But 3+ Is Common Across All Tiers",
    subtitle = "% of customer-months by integration bucket within each Marketing Hub tier",
    x        = "Marketing Hub Tier",
    y        = "% of Customer-Months",
    fill     = "Integration Depth"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", color = hs_navy),
    plot.subtitle    = element_text(color = "gray50"),
    axis.text        = element_text(color = hs_navy),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position  = "bottom"
  )

ggsave("viz7_tier_x_depth.png", width = 10, height = 6, dpi = 300)


# --- VIZ 8: Cancellation Rate by Integration ---
# Zapier highlighted in orange as the standout
# Instagram highlighted in navy as the outlier to flag

integration_bar_colors <- integration_retention %>%
  mutate(bar_color = case_when(
    INTEGRATION == "Zapier"    ~ hs_orange,
    INTEGRATION == "Instagram" ~ hs_navy,
    TRUE                       ~ hs_aqua
  )) %>%
  pull(bar_color)

ggplot(integration_retention,
       aes(x = reorder(INTEGRATION, cancellation_rate),
           y = cancellation_rate)) +
  geom_col(fill = integration_bar_colors) +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    hjust = -0.1, size = 3.5, fontface = "bold", color = hs_navy
  ) +
  coord_flip() +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.10)
  ) +
  labs(
    title    = "Zapier Users Cancel at Half the Rate & Instagram Warrants Attention",
    subtitle = "Monthly cancellation rate by integration | Marketing Hub customers",
    x        = NULL,
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.y = element_blank(),
    legend.position    = "none"
  )

ggsave("viz8_cancellation_by_integration.png", width = 10, height = 6, dpi = 300)


# --- VIZ 9: Sales Hub Retention by Integration Depth ---
# Directional comparison — mirrors Marketing Hub finding

sales_colors <- c(
  "0 Integrations"  = "gray80",
  "1 Integration"   = hs_aqua,
  "2 Integrations"  = hs_aqua,
  "3+ Integrations" = hs_orange
)

ggplot(sales_hub_retention,
       aes(x = int_bucket,
           y = cancellation_rate,
           fill = int_bucket)) +
  geom_col() +
  geom_text(
    aes(label = percent(cancellation_rate, accuracy = 0.1)),
    vjust = -0.5, size = 4, fontface = "bold", color = hs_navy
  ) +
  scale_fill_manual(values = sales_colors) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 0.11)
  ) +
  labs(
    title    = "Integration Depth Drives Lower Cancellation in Sales Hub Too",
    subtitle = "Monthly cancellation rate by integration depth | Sales Hub (smaller sample)",
    x        = "Integration Depth",
    y        = "Cancellation Rate"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title         = element_text(face = "bold", color = hs_navy),
    plot.subtitle      = element_text(color = "gray50"),
    axis.text          = element_text(color = hs_navy),
    panel.grid.minor   = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position    = "none"
  )

ggsave("viz9_sales_hub_retention.png", width = 10, height = 6, dpi = 300)

6: Next Steps and Data Gaps

Key Findings Summary

Finding 1: Integration adoption is near-universal and growing

97.9% of customers used at least one integration as of June 2024, up from 96.3% in January 2024.
Average of 3.33 integrations per customer per month.
The majority of customers (66.2%) use 3+ integrations, suggesting deep platform embedding is the norm.

Finding 2: Integration depth is strongly associated with lower cancellation rates

Customers with 0 integrations cancel at 9.5% monthly vs. 5.6% for customers with 3+ integrations.
That represents a ~41% reduction in cancellation rate at peak integration depth.
Pattern holds directionally in Sales Hub as well.

Finding 3: Integration depth tracks directly with customer value

Average MRR for 0-integration customers: $91.
Average MRR for 3+ integration customers: $319.
Deeply integrated customers are worth 3.5x more on average — making them both more valuable and more likely to stay.

Finding 4: Zapier is the highest-retention integration; Instagram warrants monitoring

Zapier users cancel at just 3.7% monthly - nearly half the rate of most other integrations.
Zapier’s workflow automation nature embeds HubSpot into operational processes, creating deeper attachment.
Instagram users cancel at 7.7% - the highest of any integration. May reflect a different customer profile rather than a causal relationship.

Finding 5: Integration depth scales with tier but is broadly adopted across all tiers

88.5% of Enterprise customers use 3+ integrations vs. 53.5% of Starter customers.
Even at the Starter tier, the majority of customers are deeply integrated - suggesting integrations are a universal retention lever, not exclusive to high-value accounts.

Analytical Limitations

1. Correlation, Not Causation

Integration depth correlates with better retention but does not prove causation. It is possible that customers who intend to stay longer are simply more likely to invest in integrations - rather than integrations themselves driving retention. A controlled experiment or longer longitudinal study would be needed to establish causality.

2. Short Observation Window

Six months of data limits our ability to assess true retention patterns. Churn events often lag integration behavior by months or quarters. A 12–24 month window would produce more reliable C$R calculations.

3. Integration Timing Unknown

We do not know when a customer first adopted an integration relative to their subscription start date or any churn signals. Early integration adoption may be a stronger predictor of retention than late-stage adoption. Time-to-first-integration analysis would require historical onboarding data.

4. Sales Hub Sample Size

Only ~27% of customers have Sales Hub data, making Sales Hub findings directional only. Conclusions should not be generalized without a larger sample.

5. Instagram Interpretation

Instagram’s elevated cancellation rate likely reflects customer profile differences rather than a causal relationship. Smaller, more transactional businesses may gravitate toward social integrations while also having higher natural churn rates. Controlling for tier and company size would clarify this finding.

Proposed Next Steps

Immediate Opportunities (with current data)

1. Identify and Prioritize Zero-Integration Customers

The 283 customers with no integrations represent an immediate CS intervention opportunity. These customers cancel at 9.5% monthly and generate the lowest average MRR. A targeted outreach campaign to drive even one integration adoption could meaningfully improve retention for this segment.

2. Build a Zapier Promotion Strategy

Given Zapier’s outsized retention correlation (3.7% cancellation rate), CS and Product should evaluate whether active promotion of Zapier during onboarding or QBRs could accelerate retention improvement - particularly for Professional and Starter tier customers who use it least.

3. Monitor the Instagram Cohort

Flag Instagram-only customers for proactive CS outreach. Investigate whether this segment skews toward Starter tier and shorter tenure - if so, targeted onboarding support could reduce early churn.

Longer Term Investigations (additional data needed)

4. Time-to-First-Integration Analysis

With onboarding timestamp data, we could measure whether customers who adopt integrations earlier in their lifecycle retain better than late adopters. This would inform whether integration adoption should be a formal onboarding milestone.

5. Integration Sequence Modeling

Do customers who adopt integrations in a specific order (e.g., Gmail → Google Calendar → Zapier) show better retention than those who adopt them in other sequences? Sequence analysis could inform a prescriptive integration journey for CS playbooks.

6. Controlled Retention Experiment

To move from correlation to causation, design an A/B test where one customer cohort receives active integration adoption support during onboarding and another follows the standard path. Measure the 12-month retention delta between cohorts.

7. Expand to Full Product Suite

Current data covers Marketing Hub and Sales Hub only. Extending this analysis to Service Hub, CMS Hub, and Operations Hub would give a complete picture of integration impact across the HubSpot ecosystem.

HubSpot Integrations Model Exercise

Laura Sinkler

April 2026

1: Setup & Data Ingestion

2: Data Cleaning

Pre-Clean Observations

Cleaning

Post-Clean Observations

3: Integration Landscape

Observations

Visualizations

4: Do Integrations Matter?

Observations

Visualizations

5: Deeper Cuts

Observations

Visualizations

6: Next Steps and Data Gaps

Key Findings Summary

Analytical Limitations

Proposed Next Steps

Immediate Opportunities (with current data)

Longer Term Investigations (additional data needed)