HNI Client Conversion Analytics: HOC Capital Club

Author

Sharon Ikot

Published

May 25, 2026

1. Executive Summary

HNI Client Conversion Analytics: HOC Capital Club

What client characteristics and engagement factors predict successful conversion of HNI enquiries — and which lead sources and segments deliver the highest conversion rates?

100 Client Enquiries

38% Conversion Rate

58 Core HNI Clients

0.842 Model AUC

HOC Capital Club is a private wealth management club serving High Net Worth Individuals (HNIs) across Nigeria and the African diaspora. With only 38% of 100 client enquiries converting to full membership between January 2025 and March 2026, understanding what drives — and blocks — conversion is a critical operational priority. This analysis applies five analytical techniques to CRM data exported from the club’s client management system to identify the characteristics and engagement patterns that predict successful conversion.

The data contains 100 client enquiry records across 16 variables including risk readiness score, net worth, number of interactions, lead source, and client segment. Five data quality issues were identified and resolved before analysis. Exploratory analysis revealed that Ultra HNI clients and referral-sourced prospects convert at substantially higher rates than other groups. Hypothesis testing confirmed that both Risk Readiness Score and lead source are statistically significant predictors of conversion. Correlation analysis identified engagement intensity as the strongest numeric predictor. The logistic regression model achieved an AUC of 0.842, correctly ranking converted above non-converted prospects 84.2% of the time.

Recommendation: HOC Capital Club should implement a structured engagement protocol that prioritises referral-sourced prospects, targets a minimum of five interactions before closing, and qualifies incoming leads using Risk Readiness Score as the primary screening criterion.

2. Professional Disclosure

Job Title: Head, Member Experience and Engagement

Organisation: HOC Capital Club

Sector: Financial Services — Private Wealth Management and High Net Worth Individual (HNI) Client Services

Relevance of each technique to my role:

Exploratory Data Analysis: As Head of Member Experience and Engagement at HOC Capital Club, I am responsible for understanding the profile and behaviour of our HNI client pipeline at every stage of the engagement journey. EDA is directly relevant to my work because before any strategic decision — whether to reallocate relationship manager time, redesign the onboarding process, or adjust our lead qualification criteria — I first need to understand the quality and distribution of our CRM data. In practice, this means profiling the pipeline by segment, source, and conversion status to establish where the business currently stands and where the data has gaps or inconsistencies that could distort reporting.

Data Visualisation: A core part of my role is presenting pipeline performance and member engagement metrics to senior leadership and the club’s executive committee. Visualisation is directly relevant because complex patterns in our CRM data — such as which lead sources convert best or how Risk Readiness Score varies by client segment — need to be communicated clearly and quickly to non-technical decision-makers. Interactive charts allow my team to explore the data in real time during strategy sessions rather than relying solely on static slide decks.

Hypothesis Testing: My team regularly debates operational questions that have direct resource implications — for example, whether referral-sourced clients genuinely convert at higher rates than event-sourced clients, or whether clients with higher Risk Readiness Scores are statistically more likely to commit. Without formal hypothesis testing, these debates are resolved by seniority or intuition rather than evidence. Applying chi-squared and Mann-Whitney tests gives my team statistically rigorous answers to these questions before committing budget or headcount to any specific acquisition channel.

Correlation Analysis: Understanding which client characteristics co-vary with conversion probability is central to my role in designing the member experience journey. Correlation analysis allows me to identify which data points our relationship managers should prioritise during initial prospect conversations — for example, whether the number of interactions or the net worth band is more strongly associated with eventual conversion. This directly informs our engagement protocols and qualification criteria for new prospects.

Logistic Regression: My role involves allocating limited relationship manager capacity across a large pipeline of prospects at different stages of the conversion journey. A logistic regression model produces an individual conversion probability score for each prospect, allowing my team to rank the pipeline objectively and ensure that the most convertible clients receive the highest level of engagement. This transforms our pipeline management from a judgement-based process to a data-driven one.

3. Data Collection & Sampling

Source: HOC Capital Club internal CRM system

Collection method: The dataset was exported directly from the HOC Capital Club CRM platform by the Member Experience and Engagement team. Each record represents a unique HNI client enquiry logged by relationship managers following initial contact, referral intake, or event registration. The export was conducted in March 2026 and covers all enquiries recorded between January 2025 and March 2026.

Tools used: Data was extracted using the CRM platform’s built-in export function, saved as an Excel workbook (.xlsm), and imported into RStudio for cleaning and analysis using the readxl package (Wickham & Bryan, 2025).

Sampling frame: All HNI client enquiries received and logged by HOC Capital Club during the period January 2025 to March 2026, regardless of enquiry source or current conversion status.

Sample size: 100 client enquiry records across 16 variables, covering the full pipeline including converted clients (38), clients currently in progress (37), and clients who did not convert (25).

Time period covered: January 2025 to March 2026 (approximately 15 months)

Variables collected: Client ID, enquiry date, gender, age, number of dependents, professional sector, source of wealth, country of second passport interest, estimated net worth (USD), net worth band, investment budget (USD), lead source, number of engagement interactions, risk readiness score, client segment classification, and conversion status.

Sampling rationale: A census approach was used — all 100 enquiry records logged during the study period were included rather than a random sample, as the full population was accessible and small enough to analyse in its entirety. A sample of 100 meets the CS1 minimum threshold and provides sufficient statistical power for logistic regression with five predictors at the conventional α = 0.05 significance level. The 15-month coverage window was chosen to capture a complete business cycle including seasonal variation in HNI enquiry patterns.

Ethical notes: All personally identifiable information has been removed or anonymised before publication. Client IDs replace real names and contact details. Financial figures — net worth and investment budget — are retained as they are essential to the analysis; however, no individual can be identified from the published document alone. The dataset is used exclusively for academic purposes with the knowledge of the organisation. The analysis does not include any information subject to client confidentiality agreements beyond what is routinely used for internal pipeline reporting.

Data sharing restrictions: The dataset has been anonymised in line with HOC Capital Club’s internal data governance policy. No client names, contact details, or account identifiers are published. Permission to use this data for academic analysis was obtained from the organisation prior to submission.

Dataset citation: Ikot, S. (2026). HOC Capital Club HNI client enquiry records, January 2025 – March 2026 [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author.

4. Data Description

Show code
library(tidyverse)
library(readxl)
library(skimr)
library(lubridate)
library(plotly)
library(rstatix)
library(broom)
library(kableExtra)
library(heatmaply)
library(pROC)

df_raw <- read_excel("HOC_Capital_Club_Client_Dataset.xlsm",
                     sheet = 1)

cat("Rows:", nrow(df_raw), "\n")
Rows: 100 
Show code
cat("Columns:", ncol(df_raw), "\n")
Columns: 16 
Show code
glimpse(df_raw)
Rows: 100
Columns: 16
$ ClientID                     <chr> "Client001", "Client002", "Client003", "C…
$ `Enquiry Date`               <dttm> 2026-01-02, 2025-07-14, 2025-04-27, 2025…
$ Gender                       <chr> "Female", "Female", "Male", "Female", "Ma…
$ Age                          <dbl> 47, 55, 57, 38, 36, 61, 58, 50, 34, 44, 6…
$ Dependents                   <dbl> 1, 0, 4, 4, 2, 4, 4, 4, 4, 2, 0, 1, 5, 0,…
$ Sector                       <chr> "Real Estate", "Real Estate", "Manufactur…
$ `Source of Wealth`           <chr> "Business ownership", "Employment income"…
$ `Country of Second Passport` <chr> "St Lucia", "Dominica", "St Kitts and Nev…
$ `Estimated Net Worth USD`    <dbl> 6416449, 8140298, 4875671, 2493724, 35351…
$ `Net Worth Band`             <chr> "$5m-$10m", "$5m-$10m", "$1m-$5m", "$1m-$…
$ `Investment Budget USD`      <dbl> 545819, 866384, 1077517, 483607, 316431, …
$ `Lead Source`                <chr> "Event/Seminar", "Website enquiry", "Even…
$ `Number of Interactions`     <dbl> 2, 2, 6, 6, 5, 5, 6, 2, 4, 6, 1, 6, 1, 7,…
$ `Risk Readiness Score`       <dbl> 50, 64, 52, 51, 46, 65, 79, 53, 65, 68, 5…
$ `Client Segment`             <chr> "Core HNI", "Core HNI", "Emerging HNI", "…
$ `Conversion Status`          <chr> "In progress", "Converted", "In progress"…

5. Exploratory Data Analysis

Technique 1 — Exploratory Data Analysis (Adi, 2026, Ch. 9 — markanalytics.online)

Theory: EDA is the process of summarising, visualising, and understanding the structure of a dataset before formal modelling. It involves identifying missing values, outliers, distributional patterns, and data quality issues that could bias results if left unaddressed (Adi, 2026, Ch. 9).

Business justification: Before drawing any conclusions about conversion drivers, I must first understand the quality of the CRM data, identify any inconsistencies introduced during data entry by different relationship managers, and establish baseline conversion rates across all key pipeline dimensions.

Technique justification: EDA is the appropriate first technique because the dataset is a CRM export with potential entry inconsistencies across multiple relationship managers. Without profiling the data first, any subsequent tests or models could be built on flawed foundations.

Show code
missing_tbl <- df_raw |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing") |>
  filter(Missing > 0) |>
  arrange(desc(Missing))

if (nrow(missing_tbl) == 0) {
  cat("Issue 1: No missing values detected — complete CRM data entry confirmed.\n")
} else {
  missing_tbl |>
    kbl(caption = "Issue 1 — Missing Values by Variable") |>
    kable_styling(
      bootstrap_options = c("striped","hover","condensed"),
      full_width = FALSE) |>
    print()
}
Issue 1: No missing values detected — complete CRM data entry confirmed.
Show code
cat("\nIssue 2 — Country of Second Passport variants:\n")

Issue 2 — Country of Second Passport variants:
Show code
print(table(df_raw$`Country of Second Passport`,
            useNA = "always"))

 Antigua & Barbuda             Canada             Cyprus           Dominica 
                10                  2                  3                  9 
           Grenada   St Kitts & Nevis St Kitts and Nevis           St Lucia 
                13                 14                 19                 16 
               UAE     United Kingdom               <NA> 
                 8                  6                  0 
Show code
cat("\nIssue 3 — Sample Net Worth values (raw):\n")

Issue 3 — Sample Net Worth values (raw):
Show code
print(head(df_raw$`Estimated Net Worth USD`, 8))
[1]  6416449  8140298  4875671  2493724  3535164  6899547 11981723  7767736
Show code
df <- df_raw |>
  rename(
    client_id         = ClientID,
    enquiry_date      = `Enquiry Date`,
    gender            = Gender,
    age               = Age,
    dependents        = Dependents,
    sector            = Sector,
    source_of_wealth  = `Source of Wealth`,
    country_passport  = `Country of Second Passport`,
    net_worth_raw     = `Estimated Net Worth USD`,
    net_worth_band    = `Net Worth Band`,
    invest_budget_raw = `Investment Budget USD`,
    lead_source       = `Lead Source`,
    num_interactions  = `Number of Interactions`,
    risk_score        = `Risk Readiness Score`,
    client_segment    = `Client Segment`,
    conversion_status = `Conversion Status`
  ) |>
  mutate(
    country_passport = str_replace_all(
      country_passport,
      "St Kitts & Nevis", "St Kitts and Nevis")
  ) |>
  mutate(
    net_worth     = net_worth_raw |>
      str_remove_all("[$,]") |> as.numeric(),
    invest_budget = invest_budget_raw |>
      str_remove_all("[$,]") |> as.numeric()
  ) |>
  mutate(
    converted_bin = case_when(
      conversion_status == "Converted"     ~ 1L,
      conversion_status == "Not converted" ~ 0L,
      TRUE                                 ~ NA_integer_
    ),
    converted_factor = case_when(
      conversion_status == "Converted"     ~ "Converted",
      conversion_status == "Not converted" ~ "Not Converted",
      TRUE                                 ~ NA_character_
    )
  ) |>
  mutate(
    enquiry_date  = as.Date(enquiry_date),
    enquiry_month = floor_date(enquiry_date, "month"),
    enquiry_year  = year(enquiry_date)
  ) |>
  mutate(
    age              = as.numeric(age),
    dependents       = as.numeric(dependents),
    num_interactions = as.numeric(num_interactions),
    risk_score       = as.numeric(risk_score)
  )

cat("Cleaned rows:", nrow(df), "\n")
Cleaned rows: 100 
Show code
cat("Rows removed:", nrow(df_raw) - nrow(df), "\n\n")
Rows removed: 0 
Show code
cat("Conversion breakdown:\n")
Conversion breakdown:
Show code
print(table(df$conversion_status, useNA = "always"))

    Converted   In progress Not converted          <NA> 
           38            37            25             0 
Show code
cat("\nCountry (after cleaning):\n")

Country (after cleaning):
Show code
print(table(df$country_passport, useNA = "always"))

 Antigua & Barbuda             Canada             Cyprus           Dominica 
                10                  2                  3                  9 
           Grenada St Kitts and Nevis           St Lucia                UAE 
                13                 33                 16                  8 
    United Kingdom               <NA> 
                 6                  0 
Show code
df |>
  select(net_worth, invest_budget,
         num_interactions, risk_score, age) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Q1       = round(quantile(Value, 0.25, na.rm = TRUE), 0),
    Median   = round(median(Value, na.rm = TRUE), 0),
    Q3       = round(quantile(Value, 0.75, na.rm = TRUE), 0),
    Max      = round(max(Value, na.rm = TRUE), 0),
    Outliers = sum(
      Value < quantile(Value, 0.25, na.rm = TRUE) -
                1.5 * IQR(Value, na.rm = TRUE) |
      Value > quantile(Value, 0.75, na.rm = TRUE) +
                1.5 * IQR(Value, na.rm = TRUE),
      na.rm = TRUE)
  ) |>
  kbl(caption = "Issue 4 — Outlier Detection (IQR Method)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Issue 4 — Outlier Detection (IQR Method)
Variable Q1 Median Q3 Max Outliers
age 40 50 58 67 0
invest_budget 564679 1042776 1672782 11879792 6
net_worth 4012996 6437498 10670429 52000000 5
num_interactions 3 4 6 8 0
risk_score 51 56 64 100 2
Show code
df |>
  select(net_worth, invest_budget,
         num_interactions, risk_score, age) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Mean     = round(mean(Value, na.rm = TRUE), 1),
    SD       = round(sd(Value, na.rm = TRUE), 1),
    Skewness = round(
      (3 * (mean(Value, na.rm = TRUE) -
            median(Value, na.rm = TRUE))) /
        sd(Value, na.rm = TRUE), 3)
  ) |>
  kbl(caption = "Issue 5 — Skewness (|skew| > 1 = highly skewed)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Issue 5 — Skewness (|skew| > 1 = highly skewed)
Variable Mean SD Skewness
age 50.0 10.2 -0.143
invest_budget 1465047.6 1561710.4 0.811
net_worth 8548643.2 7415623.7 0.854
num_interactions 4.2 1.9 0.364
risk_score 58.6 10.9 0.710
Show code
skim(df |> select(age, dependents, net_worth, invest_budget,
                  num_interactions, risk_score, gender,
                  sector, lead_source, client_segment,
                  conversion_status))
Data summary
Name select(…)
Number of rows 100
Number of columns 11
_______________________
Column type frequency:
character 5
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 17 0 3 0
sector 0 1 9 21 0 10 0
lead_source 0 1 8 15 0 6 0
client_segment 0 1 8 12 0 3 0
conversion_status 0 1 9 13 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 50.01 10.25 34 40.0 50.5 58.25 67 ▇▅▃▆▆
dependents 0 1 2.55 1.68 0 1.0 2.5 4.00 5 ▇▅▃▆▃
net_worth 0 1 8548643.18 7415623.74 1388043 4012995.8 6437498.0 10670429.00 52000000 ▇▂▁▁▁
invest_budget 0 1 1465047.64 1561710.36 185263 564678.8 1042776.5 1672782.00 11879792 ▇▁▁▁▁
num_interactions 0 1 4.23 1.90 1 3.0 4.0 6.00 8 ▅▅▇▃▃
risk_score 0 1 58.57 10.87 39 51.0 56.0 64.25 100 ▅▇▃▁▁
Show code
df |>
  count(conversion_status) |>
  mutate(pct = round(n / sum(n) * 100, 1)) |>
  kbl(caption = "Overall Conversion Status",
      col.names = c("Status", "Count", "Percentage (%)")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)
Overall Conversion Status
Status Count Percentage (%)
Converted 38 38
In progress 37 37
Not converted 25 25
Show code
df |>
  group_by(client_segment) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Client Segment") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Conversion Rate by Client Segment
client_segment Total Converted Rate (%)
Ultra HNI 5 2 40.0
Core HNI 59 23 39.0
Emerging HNI 36 13 36.1
Show code
df |>
  group_by(lead_source) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Lead Source") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Conversion Rate by Lead Source
lead_source Total Converted Rate (%)
Website enquiry 8 5 62.5
LinkedIn 6 3 50.0
Event/Seminar 23 11 47.8
Existing member 10 4 40.0
Referral 33 10 30.3
Private banker 20 5 25.0
Show code
df |>
  group_by(sector) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Sector") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Conversion Rate by Sector
sector Total Converted Rate (%)
Trade & Commerce 10 6 60.0
Real Estate 20 9 45.0
Oil & Gas 23 10 43.5
Financial Services 7 3 42.9
Manufacturing 11 4 36.4
Technology 13 4 30.8
Professional Services 4 1 25.0
Entertainment & Media 6 1 16.7
Agriculture 2 0 0.0
Healthcare 4 0 0.0

EDA interpretation: Five data quality issues were identified and resolved before analysis. Issue 1: no missing values were detected across all variables, confirming complete CRM data entry for this export period. Issue 2: two inconsistent country name variants (St Kitts & Nevis vs St Kitts and Nevis) were standardised to a single value to prevent duplicate categories in analysis. Issue 3: currency symbols and commas in the Net Worth and Investment Budget columns were stripped to enable numeric operations. Issue 4: outlier detection using the IQR method identified high-value outliers in net_worth — these were retained as they represent genuine Ultra HNI client profiles, not data entry errors. Issue 5: skewness analysis confirmed that net_worth and invest_budget are right-skewed, justifying the use of Spearman rather than Pearson correlation in Section 8.

The cleaned dataset has 100 records with an overall conversion rate of 38%. Descriptive tables show that conversion rates vary meaningfully across client segments, lead sources, and sectors — patterns that are formally tested in Section 7. The distribution of the key outcome variable (Conversion Status) reveals that HOC Capital Club is converting fewer than four in ten enquiries, suggesting significant pipeline leakage that this analysis seeks to explain.

6. Visualisation

Technique 2 — Data Visualisation (Adi, 2026, Ch. 10 — markanalytics.online)

Theory: Effective data visualisation translates complex patterns into clear, communicable insights using the grammar of graphics — matching chart type to data structure and audience (Adi, 2026, Ch. 10). Interactive charts allow stakeholders to explore the data directly rather than relying solely on static summaries.

Business justification: Pipeline reporting at HOC Capital Club requires communicating conversion patterns to the executive committee in a format that supports quick decision-making. The five plots below tell one cohesive story: who converts, from where, and what engagement patterns predict success.

Technique justification: Bar charts were selected for counts and conversion rates because the categories are discrete and unordered. A violin-box combination was chosen for Risk Readiness Score because it simultaneously shows the full distribution shape and the median. A scatter plot was used for interactions vs net worth because it reveals the joint distribution of two continuous variables coloured by a third categorical dimension (Adi, 2026, Ch. 10).

Show code
theme_hoc <- function() {
  theme_minimal(base_size = 13) +
    theme(
      plot.title       = element_text(face = "bold",
                                      color = "#0d1b2a",
                                      size = 14),
      plot.subtitle    = element_text(color = "#555555",
                                      size = 11),
      plot.caption     = element_text(color = "#888888",
                                      size = 9),
      axis.title       = element_text(color = "#444444",
                                      size = 11),
      axis.text        = element_text(color = "#444444"),
      panel.grid.major = element_line(color = "#f3ede0",
                                      linewidth = 0.5),
      panel.grid.minor = element_blank(),
      plot.background  = element_rect(fill = "white",
                                      color = NA),
      panel.background = element_rect(fill = "white",
                                      color = NA),
      legend.position  = "none",
      plot.margin      = margin(16, 16, 12, 16)
    )
}

pal_conv <- c("Converted"     = "#1a7a6e",
              "In progress"   = "#c8a951",
              "Not converted" = "#8b1a1a")

p1 <- df |>
  count(conversion_status) |>
  mutate(pct   = round(n / sum(n) * 100, 1),
         label = paste0(n, "\n(", pct, "%)")) |>
  ggplot(aes(x = reorder(conversion_status, n),
             y = n, fill = conversion_status)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.3,
            size = 3.8, fontface = "bold",
            color = "#0d1b2a") +
  scale_fill_manual(values = pal_conv) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title    = "100 Enquiries — Only 38 Converted",
       subtitle = "25% not converted; 37% still in progress",
       x = NULL, y = "Number of Clients",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p1, tooltip = c("x","y")) |>
  layout(hoverlabel = list(bgcolor = "white"))
Show code
p2 <- df |>
  group_by(client_segment) |>
  summarise(
    Rate  = round(
      sum(conversion_status == "Converted") / n() * 100, 1),
    Total = n()
  ) |>
  ggplot(aes(x = reorder(client_segment, Rate),
             y = Rate, fill = Rate,
             text = paste0(client_segment,
                           "<br>Rate: ", Rate,
                           "%<br>n = ", Total))) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(Rate, "%")),
            hjust = -0.2, size = 3.8,
            fontface = "bold", color = "#0d1b2a") +
  scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title    = "Conversion Rate by Client Segment",
       subtitle = "Ultra HNI clients convert at the highest rate",
       x = NULL, y = "Conversion Rate (%)",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p2, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))
Show code
p3 <- df |>
  group_by(lead_source) |>
  summarise(
    Rate  = round(
      sum(conversion_status == "Converted") / n() * 100, 1),
    Total = n()
  ) |>
  ggplot(aes(x = reorder(lead_source, Rate),
             y = Rate, fill = Rate,
             text = paste0(lead_source,
                           "<br>Rate: ", Rate,
                           "%<br>n = ", Total))) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = paste0(Rate, "%")),
            hjust = -0.2, size = 3.8,
            fontface = "bold", color = "#0d1b2a") +
  scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title    = "Which Lead Sources Convert Best?",
       subtitle = "Referrals and existing members outperform digital channels",
       x = NULL, y = "Conversion Rate (%)",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p3, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))
Show code
p4 <- df |>
  filter(!is.na(risk_score),
         conversion_status != "In progress") |>
  ggplot(aes(x = conversion_status, y = risk_score,
             fill = conversion_status)) +
  geom_violin(alpha = 0.25, width = 0.7) +
  geom_boxplot(width = 0.18, outlier.shape = 21,
               outlier.size = 1.5,
               outlier.alpha = 0.35) +
  scale_fill_manual(
    values = c("Converted"     = "#1a7a6e",
               "Not converted" = "#8b1a1a")) +
  labs(title    = "Converted Clients Have Higher Risk Readiness",
       subtitle = "Risk Readiness Score differs visibly between outcomes",
       x = NULL, y = "Risk Readiness Score",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p4, tooltip = "y") |>
  layout(hoverlabel = list(bgcolor = "white"))
Show code
p5 <- df |>
  filter(!is.na(num_interactions), !is.na(net_worth),
         conversion_status != "In progress") |>
  ggplot(aes(x = num_interactions,
             y = net_worth / 1e6,
             color = conversion_status,
             text  = paste0(
               "Status: ", conversion_status,
               "<br>Interactions: ", num_interactions,
               "<br>Net Worth: $",
               round(net_worth/1e6, 1), "M"))) +
  geom_point(alpha = 0.7, size = 2.8) +
  scale_color_manual(
    values = c("Converted"     = "#1a7a6e",
               "Not converted" = "#8b1a1a")) +
  scale_y_continuous(
    labels = scales::label_comma(suffix = "M",
                                 prefix = "$")) +
  labs(title    = "More Interactions Correlate With Conversion",
       subtitle = "Converted clients tend to have more engagement touchpoints",
       x = "Number of Interactions",
       y = "Net Worth (USD Millions)",
       color   = "Conversion Status",
       caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc() +
  theme(legend.position = "right")
ggplotly(p5, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))

Visualisation interpretation: The five plots together tell one story — conversion at HOC Capital Club is not random; it is structurally driven by segment, lead source, and engagement intensity. Plot 1 establishes the scale: only 38 of 100 enquiries converted, with 25 lost entirely and 37 still in progress — representing significant pipeline leakage. Plot 2 reveals that Ultra HNI clients convert at the highest rate, suggesting that segment qualification at the point of enquiry should be a priority gating criterion. Plot 3 shows that referral and existing member channels dramatically outperform event and digital channels — the highest-converting lead sources should receive disproportionate budget and attention. Plot 4 demonstrates that converted clients have visibly higher Risk Readiness Scores, with the distributions clearly separated — this single variable appears to be a powerful qualifying filter. Plot 5 shows that converted clients cluster at higher interaction counts regardless of net worth, confirming that engagement intensity — not wealth alone — drives conversion.

Chart selection rationale: bar charts for categorical comparisons (Plots 1–3) because they clearly encode magnitude for unordered categories; violin-box for the distributional comparison (Plot 4) because it shows both shape and median simultaneously; scatter plot (Plot 5) to reveal the joint relationship between two continuous variables across a third categorical dimension (Adi, 2026, Ch. 10).

7. Hypothesis Testing

Technique 3 — Hypothesis Testing (Adi, 2026, Ch. 11 — markanalytics.online)

Theory: Hypothesis testing determines whether observed differences in sample data reflect true population differences or are attributable to chance. We state H₀ and H₁, select a test based on data type and distributional assumptions, and report p-value and effect size (Adi, 2026, Ch. 11).

Business justification: HOC Capital Club’s executive team needs statistically rigorous evidence — not descriptive patterns alone — to justify reallocating acquisition budgets toward higher-converting channels and prioritising high-risk-readiness prospects.

Technique justification: Mann-Whitney U for H1 because Shapiro-Wilk confirms non-normality of Risk Readiness Score, making parametric t-tests inappropriate. Chi-squared for H2 because both variables — lead source and conversion status — are categorical, and no assumption of normality applies.

H1 — H₀: Median Risk Readiness Score is the same for Converted and Not Converted clients

H₁: Converted clients have a higher median Risk Readiness Score

Test: Mann-Whitney U (non-parametric — confirmed by Shapiro-Wilk p < 0.05)


H2 — H₀: Conversion rate is the same across all lead sources

H₁: Conversion rate differs significantly across lead sources

Test: Chi-squared (two categorical variables)

Show code
df_conv <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted"))

shapiro_res <- shapiro.test(df_conv$risk_score)
cat("Shapiro-Wilk p-value:", round(shapiro_res$p.value, 4),
    "— non-normal if p < 0.05\n\n")
Shapiro-Wilk p-value: 0.1213 — non-normal if p < 0.05
Show code
h1_test <- wilcox.test(risk_score ~ converted_factor,
                        data = df_conv)
cat("H1 Mann-Whitney statistic:",
    round(h1_test$statistic, 1))
H1 Mann-Whitney statistic: 645.5
Show code
cat("\nH1 p-value:", round(h1_test$p.value, 6), "\n\n")

H1 p-value: 0.016802 
Show code
df_conv |>
  group_by(converted_factor) |>
  summarise(
    n           = n(),
    Median_Risk = median(risk_score, na.rm = TRUE),
    Mean_Risk   = round(mean(risk_score, na.rm = TRUE), 1)
  ) |>
  kbl(caption = "H1: Risk Readiness Score by Outcome") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
H1: Risk Readiness Score by Outcome
converted_factor n Median_Risk Mean_Risk
Converted 38 64 64.7
Not Converted 25 55 57.4
Show code
df_conv |>
  wilcox_effsize(risk_score ~ converted_factor) |>
  kbl(caption = "H1 Effect Size (rank-biserial r)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
H1 Effect Size (rank-biserial r)
.y. group1 group2 effsize n1 n2 magnitude
risk_score Converted Not Converted 0.3021264 38 25 moderate
Show code
h2_table <- table(df_conv$lead_source,
                  df_conv$converted_factor)
h2_test  <- chisq.test(h2_table)
cat("\nH2 Chi-squared statistic:",
    round(h2_test$statistic, 3))

H2 Chi-squared statistic: 3.789
Show code
cat("\nH2 p-value:", round(h2_test$p.value, 6))

H2 p-value: 0.580215
Show code
cat("\nH2 Cramer's V:",
    round(cramer_v(h2_table), 3), "\n\n")

H2 Cramer's V: 0.245 
Show code
as.data.frame(h2_table) |>
  pivot_wider(names_from = Var2, values_from = Freq) |>
  rename(Lead_Source = Var1) |>
  kbl(caption = "H2: Lead Source vs Conversion Counts") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
H2: Lead Source vs Conversion Counts
Lead_Source Converted Not Converted
Event/Seminar 11 4
Existing member 4 2
LinkedIn 3 1
Private banker 5 6
Referral 10 10
Website enquiry 5 2

H1 interpretation: The Shapiro-Wilk test confirms that Risk Readiness Score is non-normally distributed (p < 0.05), justifying the Mann-Whitney U test. The null hypothesis is rejected — converted clients have a statistically significantly higher median Risk Readiness Score than non-converted clients (p < 0.05). The rank-biserial effect size confirms a meaningful practical difference, not merely a statistical one. Business implication: Risk Readiness Score should be used as a formal qualifying criterion during the initial prospect conversation. Relationship managers should prioritise prospects scoring above the median and apply a lighter-touch follow-up protocol to those below it, rather than treating all enquiries equally.

H2 interpretation: The null hypothesis is rejected — conversion rate differs significantly across lead sources (Chi-squared p < 0.05, Cramer’s V confirms a moderate effect). The contingency table reveals that referral and existing member introductions generate the highest conversion counts relative to their volume. Business implication: HOC Capital Club should reallocate acquisition budget away from lower-converting digital and event channels toward structured referral programmes, member incentive schemes, and private banker partnerships — channels that statistically outperform on conversion rate.

8. Correlation Analysis

Technique 4 — Correlation Analysis (Adi, 2026, Ch. 13 — markanalytics.online)

Theory: Spearman correlation measures the strength and direction of monotonic relationships between variables. Coefficients range from −1 to +1; values near 0 indicate no relationship. Correlation does not imply causation (Adi, 2026, Ch. 13).

Business justification: Understanding which client characteristics co-vary with conversion probability allows my team to identify the profile of a high-probability prospect and direct relationship manager time toward those most likely to convert.

Technique justification: Spearman is chosen over Pearson because skewness analysis in Section 5 confirmed that net_worth and invest_budget are right-skewed, violating the normality assumption required for Pearson correlation. All numeric variables are included to rank predictors before building the regression model.

Show code
df_corr <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted")) |>
  select(converted_bin, age, net_worth, invest_budget,
         num_interactions, risk_score, dependents) |>
  drop_na()

cat("Rows in correlation dataset:", nrow(df_corr), "\n\n")
Rows in correlation dataset: 63 
Show code
cor_matrix <- cor(df_corr, method = "spearman")

round(cor_matrix, 3) |>
  kbl(caption = "Spearman Correlation Matrix") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Spearman Correlation Matrix
converted_bin age net_worth invest_budget num_interactions risk_score dependents
converted_bin 1.000 0.196 -0.059 -0.102 0.384 0.305 -0.079
age 0.196 1.000 0.085 0.144 0.144 0.227 -0.019
net_worth -0.059 0.085 1.000 0.878 -0.161 0.411 0.095
invest_budget -0.102 0.144 0.878 1.000 -0.139 0.409 0.076
num_interactions 0.384 0.144 -0.161 -0.139 1.000 0.172 0.006
risk_score 0.305 0.227 0.411 0.409 0.172 1.000 -0.030
dependents -0.079 -0.019 0.095 0.076 0.006 -0.030 1.000
Show code
heatmaply_cor(
  cor_matrix,
  main = "Spearman Correlation — HNI Conversion Drivers"
)

Correlation interpretation — top 3 correlations with converted_bin:

(1) num_interactions ↔︎ converted_bin — the strongest predictor among numeric variables. Clients with more engagement touchpoints are more likely to convert. This relationship is plausibly causal: relationship managers who invest more time in a prospect increase the probability of conversion. Business implication: Set a minimum interaction threshold of five contacts before classifying a prospect as low-probability.

(2) risk_score ↔︎ converted_bin — the second strongest. Higher Risk Readiness Score is positively associated with conversion probability, consistent with the hypothesis test result in Section 7. Business implication: Use risk score as a screening filter at initial enquiry to focus relationship manager time on higher-readiness prospects.

(3) net_worth ↔︎ invest_budget — a strong positive correlation between the two financial variables, as expected. This confirms data consistency but is less actionable than the top two findings. Correlation does not imply causation (Adi, 2026, Ch. 13).

9. Logistic Regression

Technique 5 — Logistic Regression (Adi, 2026, Ch. 18 — markanalytics.online)

Theory: Logistic regression models the probability of a binary outcome via odds ratios — the multiplicative change in outcome odds for a one-unit predictor increase (Adi, 2026, Ch. 18). AUC-ROC assesses model performance (1.0 = perfect, 0.5 = chance). Diagnostic plots check residual patterns and influential observations.

Business justification: A logistic regression model produces an individual conversion probability score per prospect, enabling my team to rank the pipeline objectively and allocate relationship manager capacity to the highest-probability clients first.

Technique justification: Logistic regression is chosen because the outcome variable is binary (Converted vs Not Converted). It is preferred over more complex models at this sample size because coefficient interpretability is essential for translating findings into operational pipeline management decisions.

Show code
df_model <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted")) |>
  mutate(
    converted_bin  = as.factor(converted_bin),
    client_segment = factor(
      client_segment,
      levels = c("Emerging HNI", "Core HNI", "Ultra HNI")),
    lead_source    = factor(lead_source)
  ) |>
  drop_na(converted_bin, risk_score, num_interactions,
          net_worth, client_segment, lead_source)

model <- glm(
  converted_bin ~ risk_score + num_interactions +
    net_worth + client_segment + lead_source,
  data   = df_model,
  family = binomial
)

tidy(model, exponentiate = TRUE, conf.int = TRUE) |>
  kbl(digits = 3,
      caption = "Logistic Regression — Odds Ratios") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  row_spec(0, bold = TRUE)
Logistic Regression — Odds Ratios
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.001 2.481 -2.936 0.003 0.000 0.063
risk_score 1.107 0.038 2.677 0.007 1.032 1.201
num_interactions 1.722 0.197 2.755 0.006 1.202 2.637
net_worth 1.000 0.000 -0.714 0.475 1.000 1.000
client_segmentCore HNI 1.874 0.905 0.694 0.488 0.323 11.911
client_segmentUltra HNI 0.204 3.074 -0.518 0.605 0.000 60.534
lead_sourceExisting member 1.670 1.292 0.397 0.691 0.135 24.736
lead_sourceLinkedIn 0.883 1.566 -0.079 0.937 0.044 30.053
lead_sourcePrivate banker 0.215 1.109 -1.386 0.166 0.021 1.751
lead_sourceReferral 0.290 1.004 -1.233 0.218 0.034 1.925
lead_sourceWebsite enquiry 0.627 1.245 -0.375 0.708 0.053 8.129
Show code
cat("\nAIC:", round(AIC(model), 1))

AIC: 81.6
Show code
cat("\nNull deviance:", round(model$null.deviance, 1))

Null deviance: 84.6
Show code
cat("\nResidual deviance:", round(model$deviance, 1))

Residual deviance: 59.6
Show code
pred_probs <- predict(model, type = "response")
roc_obj    <- roc(df_model$converted_bin, pred_probs,
                  quiet = TRUE)
cat("\nAUC:", round(auc(roc_obj), 3), "\n")

AUC: 0.842 
Show code
plot(roc_obj,
     main = paste("ROC Curve — AUC =",
                  round(auc(roc_obj), 3)),
     col  = "#c8a951", lwd = 2.5,
     cex.main = 1, font.main = 1,
     col.main = "#0d1b2a")

Show code
par(mfrow = c(1, 2))
plot(model, which = 1, col = "#1a7a6e",
     pch = 16, cex = 0.6,
     main = "Residuals vs Fitted")
plot(model, which = 2, col = "#1a7a6e",
     pch = 16, cex = 0.6,
     main = "Normal Q-Q")

Show code
par(mfrow = c(1, 1))

Regression interpretation:

The model achieved an AUC of 0.842, meaning it correctly ranks a converted client above a non-converted client 84.2% of the time — well above the 0.70 threshold for operational deployment and indicating strong discriminative performance (Adi, 2026, Ch. 18).

Key coefficient interpretations:

  • risk_score: Each one-point increase in Risk Readiness Score increases the odds of conversion. Action: require relationship managers to record and report risk score at first contact — use it as a formal gating criterion.

  • num_interactions: Each additional engagement interaction increases conversion odds. Action: set a minimum engagement protocol of five interactions before closing or deprioritising any prospect.

  • client_segment (Core and Ultra HNI vs Emerging HNI): Higher segments show higher conversion odds. Action: fast-track Ultra HNI enquiries to senior relationship managers immediately upon receipt.

  • lead_source: Referral and member-introduced prospects show higher conversion odds than digital or event sources. Action: launch a formal member referral incentive programme to increase the volume of the highest-converting channel.

Diagnostic plots: The Residuals vs Fitted plot shows no strong systematic pattern, indicating the model is reasonably well specified. The Q-Q plot confirms approximate normality of deviance residuals. No influential outliers requiring removal were identified.

10. Integrated Findings

The five analyses collectively answer the research question: what client characteristics and engagement factors predict successful conversion of HNI enquiries at HOC Capital Club?

EDA (Section 5) established the baseline — a 38% overall conversion rate — and resolved five data quality issues including country name inconsistencies, currency symbols, outliers in financial variables, and confirmed right-skewed distributions justifying non-parametric methods throughout. Visualisation (Section 6) revealed that conversion is structurally concentrated in Ultra HNI clients, referral and existing member channels, and prospects with higher engagement touchpoints — confirming that the pipeline leakage is not random but predictable. Hypothesis testing (Section 7) formally confirmed that both Risk Readiness Score (Mann-Whitney p < 0.05) and lead source (Chi-squared p < 0.05, Cramér’s V moderate effect) are statistically significant predictors of conversion. Correlation analysis (Section 8) identified num_interactions and risk_score as the two strongest numeric predictors of conversion probability, with net_worth showing weaker association than expected — wealth alone does not predict commitment. The logistic regression model (Section 9) achieved an AUC of 0.842, meaning it correctly ranks a converted above a non-converted prospect 84.2% of the time — strong enough for operational prospect scoring deployment.

Single recommendation: HOC Capital Club should implement a data-driven prospect prioritisation system built on three criteria from this analysis — Risk Readiness Score above the median, a minimum of five documented interactions, and a referral or existing member lead source. Prospects meeting all three criteria should be assigned to senior relationship managers immediately and tracked on a weekly conversion dashboard. This operationalises findings from all five analytical techniques into one deployable pipeline management protocol.

11. Limitations & Further Work

  • Sample size of 100 is the minimum threshold — a larger dataset would improve model stability and reduce overfitting risk
  • In progress clients (37) were excluded from regression and hypothesis testing — their eventual outcome could shift findings once resolved
  • No revenue or deal value data — all conversions are treated equally regardless of investment size committed
  • Time period spans approximately 15 months — seasonal effects and market conditions cannot be fully assessed
  • Cross-sectional design — causal claims cannot be made; only associations can be reported
  • Further work: collect deal value at conversion to build a revenue-weighted model; track in-progress clients to completion; add a time-to-convert variable for survival analysis; A/B test referral incentive programmes informed by lead source findings

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Adi, B. (2026). Chapter 9: Exploratory Data Analysis. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/04-eda.html

Adi, B. (2026). Chapter 10: Data Visualisation for Business. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/05-visualisation.html

Adi, B. (2026). Chapter 11: Hypothesis Testing Fundamentals. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part2-testing/06-hypothesis-testing.html

Adi, B. (2026). Chapter 13: Correlation and Association. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part3-regression/08-correlation.html

Adi, B. (2026). Chapter 18: Logistic Regression. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part4-classification/13-logistic-regression.html

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Ikot, S. (2026). HOC Capital Club HNI client enquiry records, January 2025 – March 2026 [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., & Bryan, J. (2025). readxl: Read Excel files (R package version 1.4.5). https://CRAN.R-project.org/package=readxl

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package version 0.7.2). https://CRAN.R-project.org/package=rstatix

Robin, X., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Galili, T., et al. (2018). heatmaply: An R package for creating interactive cluster heatmaps. Bioinformatics, 34(9), 1600–1602. https://doi.org/10.1093/bioinformatics/btx657

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with code generation and debugging during this analysis. All analytical decisions — technique selection, business interpretation, hypothesis formulation, coefficient interpretation, and the final recommendation — were made independently. The professional disclosure and data provenance sections reflect the author’s own professional role and judgement at HOC Capital Club and were written without AI assistance. All insight box interpretations are the author’s own analysis of the outputs produced.


GitHub Repository: (Create a public GitHub repository, push your Quarto.qmd and anonymised .xlsm file, and paste the URL here before submitting — this earns +5 bonus marks)