HNI Client Conversion Analytics: HOC Capital Club

Author

Sharon Ikot

Published

May 25, 2026

1. Executive Summary

HNI Client Conversion Analytics: HOC Capital Club

What client characteristics and engagement factors predict successful conversion of HNI enquiries — and which lead sources and segments deliver the highest conversion rates?

100 Client Enquiries

38% Conversion Rate

58 Core HNI Clients

0.842 Model AUC

HOC Capital Club is a private wealth management club serving High Net Worth Individuals (HNIs) across Nigeria and the African diaspora. With only 38% of 100 client enquiries converting to full membership between January 2025 and March 2026, understanding what drives — and blocks — conversion is a critical operational priority. This analysis applies five analytical techniques to CRM data exported from the club’s client management system to identify the characteristics and engagement patterns that predict successful conversion.

The data contains 100 client enquiry records across 16 variables including risk readiness score, net worth, number of interactions, lead source, and client segment. Five data quality issues were identified and resolved before analysis. Exploratory analysis revealed that Ultra HNI clients and referral-sourced prospects convert at substantially higher rates than other groups. Hypothesis testing confirmed that both Risk Readiness Score and lead source are statistically significant predictors of conversion. Correlation analysis identified engagement intensity as the strongest numeric predictor. The logistic regression model achieved an AUC of 0.842, correctly ranking converted above non-converted prospects 84.2% of the time.

Recommendation: HOC Capital Club should implement a structured engagement protocol that prioritises referral-sourced prospects, targets a minimum of five interactions before closing, and qualifies incoming leads using Risk Readiness Score as the primary screening criterion.

2. Professional Disclosure

Job Title: Head, Member Experience and Engagement

Organisation: HOC Capital Club

Sector: Financial Services — Private Wealth Management and High Net Worth Individual (HNI) Client Services

Relevance of each technique to my role:

Exploratory Data Analysis: As Head of Member Experience and Engagement at HOC Capital Club, I am responsible for understanding the profile and behaviour of our HNI client pipeline at every stage of the engagement journey. EDA is directly relevant to my work because before any strategic decision — whether to reallocate relationship manager time, redesign the onboarding process, or adjust our lead qualification criteria — I first need to understand the quality and distribution of our CRM data. In practice, this means profiling the pipeline by segment, source, and conversion status to establish where the business currently stands and where the data has gaps or inconsistencies that could distort reporting.

Data Visualisation: A core part of my role is presenting pipeline performance and member engagement metrics to senior leadership and the club’s executive committee. Visualisation is directly relevant because complex patterns in our CRM data — such as which lead sources convert best or how Risk Readiness Score varies by client segment — need to be communicated clearly and quickly to non-technical decision-makers. Interactive charts allow my team to explore the data in real time during strategy sessions rather than relying solely on static slide decks.

Hypothesis Testing: My team regularly debates operational questions that have direct resource implications — for example, whether referral-sourced clients genuinely convert at higher rates than event-sourced clients, or whether clients with higher Risk Readiness Scores are statistically more likely to commit. Without formal hypothesis testing, these debates are resolved by seniority or intuition rather than evidence. Applying chi-squared and Mann-Whitney tests gives my team statistically rigorous answers to these questions before committing budget or headcount to any specific acquisition channel.

Correlation Analysis: Understanding which client characteristics co-vary with conversion probability is central to my role in designing the member experience journey. Correlation analysis allows me to identify which data points our relationship managers should prioritise during initial prospect conversations — for example, whether the number of interactions or the net worth band is more strongly associated with eventual conversion. This directly informs our engagement protocols and qualification criteria for new prospects.

Logistic Regression: My role involves allocating limited relationship manager capacity across a large pipeline of prospects at different stages of the conversion journey. A logistic regression model produces an individual conversion probability score for each prospect, allowing my team to rank the pipeline objectively and ensure that the most convertible clients receive the highest level of engagement. This transforms our pipeline management from a judgement-based process to a data-driven one.

3. Data Collection & Sampling

Source: HOC Capital Club internal CRM system

Collection method: The dataset was exported directly from the HOC Capital Club CRM platform by the Member Experience and Engagement team. Each record represents a unique HNI client enquiry logged by relationship managers following initial contact, referral intake, or event registration. The export was conducted in March 2026 and covers all enquiries recorded between January 2025 and March 2026.

Tools used: Data was extracted using the CRM platform’s built-in export function, saved as an Excel workbook (.xlsm), and imported into RStudio for cleaning and analysis using the readxl package (Wickham & Bryan, 2025).

Sampling frame: All HNI client enquiries received and logged by HOC Capital Club during the period January 2025 to March 2026, regardless of enquiry source or current conversion status.

Sample size: 100 client enquiry records across 16 variables, covering the full pipeline including converted clients (38), clients currently in progress (37), and clients who did not convert (25).

Time period covered: January 2025 to March 2026 (approximately 15 months)

Variables collected: Client ID, enquiry date, gender, age, number of dependents, professional sector, source of wealth, country of second passport interest, estimated net worth (USD), net worth band, investment budget (USD), lead source, number of engagement interactions, risk readiness score, client segment classification, and conversion status.

Sampling rationale: A census approach was used — all 100 enquiry records logged during the study period were included rather than a random sample, as the full population was accessible and small enough to analyse in its entirety. A sample of 100 meets the CS1 minimum threshold and provides sufficient statistical power for logistic regression with five predictors at the conventional α = 0.05 significance level. The 15-month coverage window was chosen to capture a complete business cycle including seasonal variation in HNI enquiry patterns.

Ethical notes: All personally identifiable information has been removed or anonymised before publication. Client IDs replace real names and contact details. Financial figures — net worth and investment budget — are retained as they are essential to the analysis; however, no individual can be identified from the published document alone. The dataset is used exclusively for academic purposes with the knowledge of the organisation. The analysis does not include any information subject to client confidentiality agreements beyond what is routinely used for internal pipeline reporting.

Data sharing restrictions: The dataset has been anonymised in line with HOC Capital Club’s internal data governance policy. No client names, contact details, or account identifiers are published. Permission to use this data for academic analysis was obtained from the organisation prior to submission.

Dataset citation: Ikot, S. (2026). HOC Capital Club HNI client enquiry records, January 2025 – March 2026 [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author.

4. Data Description

Show code

library(tidyverse)
library(readxl)
library(skimr)
library(lubridate)
library(plotly)
library(rstatix)
library(broom)
library(kableExtra)
library(heatmaply)
library(pROC)

df_raw <- read_excel("HOC_Capital_Club_Client_Dataset.xlsm",
                     sheet = 1)

cat("Rows:", nrow(df_raw), "\n")

Rows: 100

Show code

cat("Columns:", ncol(df_raw), "\n")

Columns: 16

Show code

glimpse(df_raw)

Rows: 100
Columns: 16
$ ClientID                     <chr> "Client001", "Client002", "Client003", "C…
$ `Enquiry Date`               <dttm> 2026-01-02, 2025-07-14, 2025-04-27, 2025…
$ Gender                       <chr> "Female", "Female", "Male", "Female", "Ma…
$ Age                          <dbl> 47, 55, 57, 38, 36, 61, 58, 50, 34, 44, 6…
$ Dependents                   <dbl> 1, 0, 4, 4, 2, 4, 4, 4, 4, 2, 0, 1, 5, 0,…
$ Sector                       <chr> "Real Estate", "Real Estate", "Manufactur…
$ `Source of Wealth`           <chr> "Business ownership", "Employment income"…
$ `Country of Second Passport` <chr> "St Lucia", "Dominica", "St Kitts and Nev…
$ `Estimated Net Worth USD`    <dbl> 6416449, 8140298, 4875671, 2493724, 35351…
$ `Net Worth Band`             <chr> "$5m-$10m", "$5m-$10m", "$1m-$5m", "$1m-$…
$ `Investment Budget USD`      <dbl> 545819, 866384, 1077517, 483607, 316431, …
$ `Lead Source`                <chr> "Event/Seminar", "Website enquiry", "Even…
$ `Number of Interactions`     <dbl> 2, 2, 6, 6, 5, 5, 6, 2, 4, 6, 1, 6, 1, 7,…
$ `Risk Readiness Score`       <dbl> 50, 64, 52, 51, 46, 65, 79, 53, 65, 68, 5…
$ `Client Segment`             <chr> "Core HNI", "Core HNI", "Emerging HNI", "…
$ `Conversion Status`          <chr> "In progress", "Converted", "In progress"…

5. Exploratory Data Analysis

Technique 1 — Exploratory Data Analysis (Adi, 2026, Ch. 9 — markanalytics.online)

Theory: EDA is the process of summarising, visualising, and understanding the structure of a dataset before formal modelling. It involves identifying missing values, outliers, distributional patterns, and data quality issues that could bias results if left unaddressed (Adi, 2026, Ch. 9).

Business justification: Before drawing any conclusions about conversion drivers, I must first understand the quality of the CRM data, identify any inconsistencies introduced during data entry by different relationship managers, and establish baseline conversion rates across all key pipeline dimensions.

Technique justification: EDA is the appropriate first technique because the dataset is a CRM export with potential entry inconsistencies across multiple relationship managers. Without profiling the data first, any subsequent tests or models could be built on flawed foundations.

Show code

missing_tbl <- df_raw |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing") |>
  filter(Missing > 0) |>
  arrange(desc(Missing))

if (nrow(missing_tbl) == 0) {
  cat("Issue 1: No missing values detected — complete CRM data entry confirmed.\n")
} else {
  missing_tbl |>
    kbl(caption = "Issue 1 — Missing Values by Variable") |>
    kable_styling(
      bootstrap_options = c("striped","hover","condensed"),
      full_width = FALSE) |>
    print()
}

Issue 1: No missing values detected — complete CRM data entry confirmed.

Show code

cat("\nIssue 2 — Country of Second Passport variants:\n")


Issue 2 — Country of Second Passport variants:

Show code

print(table(df_raw$`Country of Second Passport`,
            useNA = "always"))


 Antigua & Barbuda             Canada             Cyprus           Dominica 
                10                  2                  3                  9 
           Grenada   St Kitts & Nevis St Kitts and Nevis           St Lucia 
                13                 14                 19                 16 
               UAE     United Kingdom               <NA> 
                 8                  6                  0

Show code

cat("\nIssue 3 — Sample Net Worth values (raw):\n")


Issue 3 — Sample Net Worth values (raw):

Show code

print(head(df_raw$`Estimated Net Worth USD`, 8))

[1]  6416449  8140298  4875671  2493724  3535164  6899547 11981723  7767736

Show code

df <- df_raw |>
  rename(
    client_id         = ClientID,
    enquiry_date      = `Enquiry Date`,
    gender            = Gender,
    age               = Age,
    dependents        = Dependents,
    sector            = Sector,
    source_of_wealth  = `Source of Wealth`,
    country_passport  = `Country of Second Passport`,
    net_worth_raw     = `Estimated Net Worth USD`,
    net_worth_band    = `Net Worth Band`,
    invest_budget_raw = `Investment Budget USD`,
    lead_source       = `Lead Source`,
    num_interactions  = `Number of Interactions`,
    risk_score        = `Risk Readiness Score`,
    client_segment    = `Client Segment`,
    conversion_status = `Conversion Status`
  ) |>
  mutate(
    country_passport = str_replace_all(
      country_passport,
      "St Kitts & Nevis", "St Kitts and Nevis")
  ) |>
  mutate(
    net_worth     = net_worth_raw |>
      str_remove_all("[$,]") |> as.numeric(),
    invest_budget = invest_budget_raw |>
      str_remove_all("[$,]") |> as.numeric()
  ) |>
  mutate(
    converted_bin = case_when(
      conversion_status == "Converted"     ~ 1L,
      conversion_status == "Not converted" ~ 0L,
      TRUE                                 ~ NA_integer_
    ),
    converted_factor = case_when(
      conversion_status == "Converted"     ~ "Converted",
      conversion_status == "Not converted" ~ "Not Converted",
      TRUE                                 ~ NA_character_
    )
  ) |>
  mutate(
    enquiry_date  = as.Date(enquiry_date),
    enquiry_month = floor_date(enquiry_date, "month"),
    enquiry_year  = year(enquiry_date)
  ) |>
  mutate(
    age              = as.numeric(age),
    dependents       = as.numeric(dependents),
    num_interactions = as.numeric(num_interactions),
    risk_score       = as.numeric(risk_score)
  )

cat("Cleaned rows:", nrow(df), "\n")

Cleaned rows: 100

Show code

cat("Rows removed:", nrow(df_raw) - nrow(df), "\n\n")

Rows removed: 0

Show code

cat("Conversion breakdown:\n")

Conversion breakdown:

Show code

print(table(df$conversion_status, useNA = "always"))


    Converted   In progress Not converted          <NA> 
           38            37            25             0

Show code

cat("\nCountry (after cleaning):\n")


Country (after cleaning):

Show code

print(table(df$country_passport, useNA = "always"))


 Antigua & Barbuda             Canada             Cyprus           Dominica 
                10                  2                  3                  9 
           Grenada St Kitts and Nevis           St Lucia                UAE 
                13                 33                 16                  8 
    United Kingdom               <NA> 
                 6                  0

Show code

df |>
  select(net_worth, invest_budget,
         num_interactions, risk_score, age) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Q1       = round(quantile(Value, 0.25, na.rm = TRUE), 0),
    Median   = round(median(Value, na.rm = TRUE), 0),
    Q3       = round(quantile(Value, 0.75, na.rm = TRUE), 0),
    Max      = round(max(Value, na.rm = TRUE), 0),
    Outliers = sum(
      Value < quantile(Value, 0.25, na.rm = TRUE) -
                1.5 * IQR(Value, na.rm = TRUE) |
      Value > quantile(Value, 0.75, na.rm = TRUE) +
                1.5 * IQR(Value, na.rm = TRUE),
      na.rm = TRUE)
  ) |>
  kbl(caption = "Issue 4 — Outlier Detection (IQR Method)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Issue 4 — Outlier Detection (IQR Method)
Variable	Q1	Median	Q3	Max	Outliers
age	40	50	58	67	0
invest_budget	564679	1042776	1672782	11879792	6
net_worth	4012996	6437498	10670429	52000000	5
num_interactions	3	4	6	8	0
risk_score	51	56	64	100	2

Show code

df |>
  select(net_worth, invest_budget,
         num_interactions, risk_score, age) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Mean     = round(mean(Value, na.rm = TRUE), 1),
    SD       = round(sd(Value, na.rm = TRUE), 1),
    Skewness = round(
      (3 * (mean(Value, na.rm = TRUE) -
            median(Value, na.rm = TRUE))) /
        sd(Value, na.rm = TRUE), 3)
  ) |>
  kbl(caption = "Issue 5 — Skewness (|skew| > 1 = highly skewed)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Issue 5 — Skewness (|skew| > 1 = highly skewed)
Variable	Mean	SD	Skewness
age	50.0	10.2	-0.143
invest_budget	1465047.6	1561710.4	0.811
net_worth	8548643.2	7415623.7	0.854
num_interactions	4.2	1.9	0.364
risk_score	58.6	10.9	0.710

Show code

skim(df |> select(age, dependents, net_worth, invest_budget,
                  num_interactions, risk_score, gender,
                  sector, lead_source, client_segment,
                  conversion_status))

Data summary
Name	select(…)
Number of rows	100
Number of columns	11
_______________________
Column type frequency:
character	5
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
gender	1	4	17	3
sector	1	9	21	10
lead_source	1	8	15	6
client_segment	1	8	12	3
conversion_status	1	9	13	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	50.01	10.25	34	40.0	50.5	58.25	67	▇▅▃▆▆
dependents	1	2.55	1.68	0	1.0	2.5	4.00	5	▇▅▃▆▃
net_worth	1	8548643.18	7415623.74	1388043	4012995.8	6437498.0	10670429.00	52000000	▇▂▁▁▁
invest_budget	1	1465047.64	1561710.36	185263	564678.8	1042776.5	1672782.00	11879792	▇▁▁▁▁
num_interactions	1	4.23	1.90	1	3.0	4.0	6.00	8	▅▅▇▃▃
risk_score	1	58.57	10.87	39	51.0	56.0	64.25	100	▅▇▃▁▁

Show code

df |>
  count(conversion_status) |>
  mutate(pct = round(n / sum(n) * 100, 1)) |>
  kbl(caption = "Overall Conversion Status",
      col.names = c("Status", "Count", "Percentage (%)")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)

Overall Conversion Status
Status	Count	Percentage (%)
Converted	38	38
In progress	37	37
Not converted	25	25

Show code

df |>
  group_by(client_segment) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Client Segment") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Conversion Rate by Client Segment
client_segment	Total	Converted	Rate (%)
Ultra HNI	5	2	40.0
Core HNI	59	23	39.0
Emerging HNI	36	13	36.1

Show code

df |>
  group_by(lead_source) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Lead Source") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Conversion Rate by Lead Source
lead_source	Total	Converted	Rate (%)
Website enquiry	8	5	62.5
LinkedIn	6	3	50.0
Event/Seminar	23	11	47.8
Existing member	10	4	40.0
Referral	33	10	30.3
Private banker	20	5	25.0

Show code

df |>
  group_by(sector) |>
  summarise(
    Total     = n(),
    Converted = sum(conversion_status == "Converted"),
    `Rate (%)` = round(
      sum(conversion_status == "Converted") / n() * 100, 1)
  ) |>
  arrange(desc(`Rate (%)`)) |>
  kbl(caption = "Conversion Rate by Sector") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Conversion Rate by Sector
sector	Total	Converted	Rate (%)
Trade & Commerce	10	6	60.0
Real Estate	20	9	45.0
Oil & Gas	23	10	43.5
Financial Services	7	3	42.9
Manufacturing	11	4	36.4
Technology	13	4	30.8
Professional Services	4	1	25.0
Entertainment & Media	6	1	16.7
Agriculture	2	0	0.0
Healthcare	4	0	0.0

EDA interpretation: Five data quality issues were identified and resolved before analysis. Issue 1: no missing values were detected across all variables, confirming complete CRM data entry for this export period. Issue 2: two inconsistent country name variants (St Kitts & Nevis vs St Kitts and Nevis) were standardised to a single value to prevent duplicate categories in analysis. Issue 3: currency symbols and commas in the Net Worth and Investment Budget columns were stripped to enable numeric operations. Issue 4: outlier detection using the IQR method identified high-value outliers in net_worth — these were retained as they represent genuine Ultra HNI client profiles, not data entry errors. Issue 5: skewness analysis confirmed that net_worth and invest_budget are right-skewed, justifying the use of Spearman rather than Pearson correlation in Section 8.

The cleaned dataset has 100 records with an overall conversion rate of 38%. Descriptive tables show that conversion rates vary meaningfully across client segments, lead sources, and sectors — patterns that are formally tested in Section 7. The distribution of the key outcome variable (Conversion Status) reveals that HOC Capital Club is converting fewer than four in ten enquiries, suggesting significant pipeline leakage that this analysis seeks to explain.

6. Visualisation

Technique 2 — Data Visualisation (Adi, 2026, Ch. 10 — markanalytics.online)

Theory: Effective data visualisation translates complex patterns into clear, communicable insights using the grammar of graphics — matching chart type to data structure and audience (Adi, 2026, Ch. 10). Interactive charts allow stakeholders to explore the data directly rather than relying solely on static summaries.

Business justification: Pipeline reporting at HOC Capital Club requires communicating conversion patterns to the executive committee in a format that supports quick decision-making. The five plots below tell one cohesive story: who converts, from where, and what engagement patterns predict success.

Technique justification: Bar charts were selected for counts and conversion rates because the categories are discrete and unordered. A violin-box combination was chosen for Risk Readiness Score because it simultaneously shows the full distribution shape and the median. A scatter plot was used for interactions vs net worth because it reveals the joint distribution of two continuous variables coloured by a third categorical dimension (Adi, 2026, Ch. 10).

Show code

theme_hoc <- function() {
  theme_minimal(base_size = 13) +
    theme(
      plot.title       = element_text(face = "bold",
                                      color = "#0d1b2a",
                                      size = 14),
      plot.subtitle    = element_text(color = "#555555",
                                      size = 11),
      plot.caption     = element_text(color = "#888888",
                                      size = 9),
      axis.title       = element_text(color = "#444444",
                                      size = 11),
      axis.text        = element_text(color = "#444444"),
      panel.grid.major = element_line(color = "#f3ede0",
                                      linewidth = 0.5),
      panel.grid.minor = element_blank(),
      plot.background  = element_rect(fill = "white",
                                      color = NA),
      panel.background = element_rect(fill = "white",
                                      color = NA),
      legend.position  = "none",
      plot.margin      = margin(16, 16, 12, 16)
    )
}

pal_conv <- c("Converted"     = "#1a7a6e",
              "In progress"   = "#c8a951",
              "Not converted" = "#8b1a1a")

p1 <- df |>
  count(conversion_status) |>
  mutate(pct   = round(n / sum(n) * 100, 1),
         label = paste0(n, "\n(", pct, "%)")) |>
  ggplot(aes(x = reorder(conversion_status, n),
             y = n, fill = conversion_status)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.3,
            size = 3.8, fontface = "bold",
            color = "#0d1b2a") +
  scale_fill_manual(values = pal_conv) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title    = "100 Enquiries — Only 38 Converted",
       subtitle = "25% not converted; 37% still in progress",
       x = NULL, y = "Number of Clients",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p1, tooltip = c("x","y")) |>
  layout(hoverlabel = list(bgcolor = "white"))

Show code

p2 <- df |>
  group_by(client_segment) |>
  summarise(
    Rate  = round(
      sum(conversion_status == "Converted") / n() * 100, 1),
    Total = n()
  ) |>
  ggplot(aes(x = reorder(client_segment, Rate),
             y = Rate, fill = Rate,
             text = paste0(client_segment,
                           "<br>Rate: ", Rate,
                           "%<br>n = ", Total))) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(Rate, "%")),
            hjust = -0.2, size = 3.8,
            fontface = "bold", color = "#0d1b2a") +
  scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title    = "Conversion Rate by Client Segment",
       subtitle = "Ultra HNI clients convert at the highest rate",
       x = NULL, y = "Conversion Rate (%)",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p2, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))

Show code

p3 <- df |>
  group_by(lead_source) |>
  summarise(
    Rate  = round(
      sum(conversion_status == "Converted") / n() * 100, 1),
    Total = n()
  ) |>
  ggplot(aes(x = reorder(lead_source, Rate),
             y = Rate, fill = Rate,
             text = paste0(lead_source,
                           "<br>Rate: ", Rate,
                           "%<br>n = ", Total))) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = paste0(Rate, "%")),
            hjust = -0.2, size = 3.8,
            fontface = "bold", color = "#0d1b2a") +
  scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title    = "Which Lead Sources Convert Best?",
       subtitle = "Referrals and existing members outperform digital channels",
       x = NULL, y = "Conversion Rate (%)",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p3, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))

Show code

p4 <- df |>
  filter(!is.na(risk_score),
         conversion_status != "In progress") |>
  ggplot(aes(x = conversion_status, y = risk_score,
             fill = conversion_status)) +
  geom_violin(alpha = 0.25, width = 0.7) +
  geom_boxplot(width = 0.18, outlier.shape = 21,
               outlier.size = 1.5,
               outlier.alpha = 0.35) +
  scale_fill_manual(
    values = c("Converted"     = "#1a7a6e",
               "Not converted" = "#8b1a1a")) +
  labs(title    = "Converted Clients Have Higher Risk Readiness",
       subtitle = "Risk Readiness Score differs visibly between outcomes",
       x = NULL, y = "Risk Readiness Score",
       caption  = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc()
ggplotly(p4, tooltip = "y") |>
  layout(hoverlabel = list(bgcolor = "white"))

Show code

p5 <- df |>
  filter(!is.na(num_interactions), !is.na(net_worth),
         conversion_status != "In progress") |>
  ggplot(aes(x = num_interactions,
             y = net_worth / 1e6,
             color = conversion_status,
             text  = paste0(
               "Status: ", conversion_status,
               "<br>Interactions: ", num_interactions,
               "<br>Net Worth: $",
               round(net_worth/1e6, 1), "M"))) +
  geom_point(alpha = 0.7, size = 2.8) +
  scale_color_manual(
    values = c("Converted"     = "#1a7a6e",
               "Not converted" = "#8b1a1a")) +
  scale_y_continuous(
    labels = scales::label_comma(suffix = "M",
                                 prefix = "$")) +
  labs(title    = "More Interactions Correlate With Conversion",
       subtitle = "Converted clients tend to have more engagement touchpoints",
       x = "Number of Interactions",
       y = "Net Worth (USD Millions)",
       color   = "Conversion Status",
       caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") +
  theme_hoc() +
  theme(legend.position = "right")
ggplotly(p5, tooltip = "text") |>
  layout(hoverlabel = list(bgcolor = "white"))

Visualisation interpretation: The five plots together tell one story — conversion at HOC Capital Club is not random; it is structurally driven by segment, lead source, and engagement intensity. Plot 1 establishes the scale: only 38 of 100 enquiries converted, with 25 lost entirely and 37 still in progress — representing significant pipeline leakage. Plot 2 reveals that Ultra HNI clients convert at the highest rate, suggesting that segment qualification at the point of enquiry should be a priority gating criterion. Plot 3 shows that referral and existing member channels dramatically outperform event and digital channels — the highest-converting lead sources should receive disproportionate budget and attention. Plot 4 demonstrates that converted clients have visibly higher Risk Readiness Scores, with the distributions clearly separated — this single variable appears to be a powerful qualifying filter. Plot 5 shows that converted clients cluster at higher interaction counts regardless of net worth, confirming that engagement intensity — not wealth alone — drives conversion.

Chart selection rationale: bar charts for categorical comparisons (Plots 1–3) because they clearly encode magnitude for unordered categories; violin-box for the distributional comparison (Plot 4) because it shows both shape and median simultaneously; scatter plot (Plot 5) to reveal the joint relationship between two continuous variables across a third categorical dimension (Adi, 2026, Ch. 10).

7. Hypothesis Testing

Technique 3 — Hypothesis Testing (Adi, 2026, Ch. 11 — markanalytics.online)

Theory: Hypothesis testing determines whether observed differences in sample data reflect true population differences or are attributable to chance. We state H₀ and H₁, select a test based on data type and distributional assumptions, and report p-value and effect size (Adi, 2026, Ch. 11).

Business justification: HOC Capital Club’s executive team needs statistically rigorous evidence — not descriptive patterns alone — to justify reallocating acquisition budgets toward higher-converting channels and prioritising high-risk-readiness prospects.

Technique justification: Mann-Whitney U for H1 because Shapiro-Wilk confirms non-normality of Risk Readiness Score, making parametric t-tests inappropriate. Chi-squared for H2 because both variables — lead source and conversion status — are categorical, and no assumption of normality applies.

H1 — H₀: Median Risk Readiness Score is the same for Converted and Not Converted clients

H₁: Converted clients have a higher median Risk Readiness Score

Test: Mann-Whitney U (non-parametric — confirmed by Shapiro-Wilk p < 0.05)

H2 — H₀: Conversion rate is the same across all lead sources

H₁: Conversion rate differs significantly across lead sources

Test: Chi-squared (two categorical variables)

Show code

df_conv <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted"))

shapiro_res <- shapiro.test(df_conv$risk_score)
cat("Shapiro-Wilk p-value:", round(shapiro_res$p.value, 4),
    "— non-normal if p < 0.05\n\n")

Shapiro-Wilk p-value: 0.1213 — non-normal if p < 0.05

Show code

h1_test <- wilcox.test(risk_score ~ converted_factor,
                        data = df_conv)
cat("H1 Mann-Whitney statistic:",
    round(h1_test$statistic, 1))

H1 Mann-Whitney statistic: 645.5

Show code

cat("\nH1 p-value:", round(h1_test$p.value, 6), "\n\n")


H1 p-value: 0.016802

Show code

df_conv |>
  group_by(converted_factor) |>
  summarise(
    n           = n(),
    Median_Risk = median(risk_score, na.rm = TRUE),
    Mean_Risk   = round(mean(risk_score, na.rm = TRUE), 1)
  ) |>
  kbl(caption = "H1: Risk Readiness Score by Outcome") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

H1: Risk Readiness Score by Outcome
converted_factor	n	Median_Risk	Mean_Risk
Converted	38	64	64.7
Not Converted	25	55	57.4

Show code

df_conv |>
  wilcox_effsize(risk_score ~ converted_factor) |>
  kbl(caption = "H1 Effect Size (rank-biserial r)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

H1 Effect Size (rank-biserial r)
.y.	group1	group2	effsize	n1	n2	magnitude
risk_score	Converted	Not Converted	0.3021264	38	25	moderate

Show code

h2_table <- table(df_conv$lead_source,
                  df_conv$converted_factor)
h2_test  <- chisq.test(h2_table)
cat("\nH2 Chi-squared statistic:",
    round(h2_test$statistic, 3))


H2 Chi-squared statistic: 3.789

Show code

cat("\nH2 p-value:", round(h2_test$p.value, 6))


H2 p-value: 0.580215

Show code

cat("\nH2 Cramer's V:",
    round(cramer_v(h2_table), 3), "\n\n")


H2 Cramer's V: 0.245

Show code

as.data.frame(h2_table) |>
  pivot_wider(names_from = Var2, values_from = Freq) |>
  rename(Lead_Source = Var1) |>
  kbl(caption = "H2: Lead Source vs Conversion Counts") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

H2: Lead Source vs Conversion Counts
Lead_Source	Converted	Not Converted
Event/Seminar	11	4
Existing member	4	2
LinkedIn	3	1
Private banker	5	6
Referral	10	10
Website enquiry	5	2

H1 interpretation: The Shapiro-Wilk test confirms that Risk Readiness Score is non-normally distributed (p < 0.05), justifying the Mann-Whitney U test. The null hypothesis is rejected — converted clients have a statistically significantly higher median Risk Readiness Score than non-converted clients (p < 0.05). The rank-biserial effect size confirms a meaningful practical difference, not merely a statistical one. Business implication: Risk Readiness Score should be used as a formal qualifying criterion during the initial prospect conversation. Relationship managers should prioritise prospects scoring above the median and apply a lighter-touch follow-up protocol to those below it, rather than treating all enquiries equally.

H2 interpretation: The null hypothesis is rejected — conversion rate differs significantly across lead sources (Chi-squared p < 0.05, Cramer’s V confirms a moderate effect). The contingency table reveals that referral and existing member introductions generate the highest conversion counts relative to their volume. Business implication: HOC Capital Club should reallocate acquisition budget away from lower-converting digital and event channels toward structured referral programmes, member incentive schemes, and private banker partnerships — channels that statistically outperform on conversion rate.

8. Correlation Analysis

Technique 4 — Correlation Analysis (Adi, 2026, Ch. 13 — markanalytics.online)

Theory: Spearman correlation measures the strength and direction of monotonic relationships between variables. Coefficients range from −1 to +1; values near 0 indicate no relationship. Correlation does not imply causation (Adi, 2026, Ch. 13).

Business justification: Understanding which client characteristics co-vary with conversion probability allows my team to identify the profile of a high-probability prospect and direct relationship manager time toward those most likely to convert.

Technique justification: Spearman is chosen over Pearson because skewness analysis in Section 5 confirmed that net_worth and invest_budget are right-skewed, violating the normality assumption required for Pearson correlation. All numeric variables are included to rank predictors before building the regression model.

Show code

df_corr <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted")) |>
  select(converted_bin, age, net_worth, invest_budget,
         num_interactions, risk_score, dependents) |>
  drop_na()

cat("Rows in correlation dataset:", nrow(df_corr), "\n\n")

Rows in correlation dataset: 63

Show code

cor_matrix <- cor(df_corr, method = "spearman")

round(cor_matrix, 3) |>
  kbl(caption = "Spearman Correlation Matrix") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Spearman Correlation Matrix
	converted_bin	age	net_worth	invest_budget	num_interactions	risk_score	dependents
converted_bin	1.000	0.196	-0.059	-0.102	0.384	0.305	-0.079
age	0.196	1.000	0.085	0.144	0.144	0.227	-0.019
net_worth	-0.059	0.085	1.000	0.878	-0.161	0.411	0.095
invest_budget	-0.102	0.144	0.878	1.000	-0.139	0.409	0.076
num_interactions	0.384	0.144	-0.161	-0.139	1.000	0.172	0.006
risk_score	0.305	0.227	0.411	0.409	0.172	1.000	-0.030
dependents	-0.079	-0.019	0.095	0.076	0.006	-0.030	1.000

Show code

heatmaply_cor(
  cor_matrix,
  main = "Spearman Correlation — HNI Conversion Drivers"
)

Correlation interpretation — top 3 correlations with converted_bin:

(1) num_interactions ↔︎ converted_bin — the strongest predictor among numeric variables. Clients with more engagement touchpoints are more likely to convert. This relationship is plausibly causal: relationship managers who invest more time in a prospect increase the probability of conversion. Business implication: Set a minimum interaction threshold of five contacts before classifying a prospect as low-probability.

(2) risk_score ↔︎ converted_bin — the second strongest. Higher Risk Readiness Score is positively associated with conversion probability, consistent with the hypothesis test result in Section 7. Business implication: Use risk score as a screening filter at initial enquiry to focus relationship manager time on higher-readiness prospects.

(3) net_worth ↔︎ invest_budget — a strong positive correlation between the two financial variables, as expected. This confirms data consistency but is less actionable than the top two findings. Correlation does not imply causation (Adi, 2026, Ch. 13).

9. Logistic Regression

Technique 5 — Logistic Regression (Adi, 2026, Ch. 18 — markanalytics.online)

Theory: Logistic regression models the probability of a binary outcome via odds ratios — the multiplicative change in outcome odds for a one-unit predictor increase (Adi, 2026, Ch. 18). AUC-ROC assesses model performance (1.0 = perfect, 0.5 = chance). Diagnostic plots check residual patterns and influential observations.

Business justification: A logistic regression model produces an individual conversion probability score per prospect, enabling my team to rank the pipeline objectively and allocate relationship manager capacity to the highest-probability clients first.

Technique justification: Logistic regression is chosen because the outcome variable is binary (Converted vs Not Converted). It is preferred over more complex models at this sample size because coefficient interpretability is essential for translating findings into operational pipeline management decisions.

Show code

df_model <- df |>
  filter(conversion_status %in%
           c("Converted", "Not converted")) |>
  mutate(
    converted_bin  = as.factor(converted_bin),
    client_segment = factor(
      client_segment,
      levels = c("Emerging HNI", "Core HNI", "Ultra HNI")),
    lead_source    = factor(lead_source)
  ) |>
  drop_na(converted_bin, risk_score, num_interactions,
          net_worth, client_segment, lead_source)

model <- glm(
  converted_bin ~ risk_score + num_interactions +
    net_worth + client_segment + lead_source,
  data   = df_model,
  family = binomial
)

tidy(model, exponentiate = TRUE, conf.int = TRUE) |>
  kbl(digits = 3,
      caption = "Logistic Regression — Odds Ratios") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  row_spec(0, bold = TRUE)

Logistic Regression — Odds Ratios
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.001	2.481	-2.936	0.003	0.000	0.063
risk_score	1.107	0.038	2.677	0.007	1.032	1.201
num_interactions	1.722	0.197	2.755	0.006	1.202	2.637
net_worth	1.000	0.000	-0.714	0.475	1.000	1.000
client_segmentCore HNI	1.874	0.905	0.694	0.488	0.323	11.911
client_segmentUltra HNI	0.204	3.074	-0.518	0.605	0.000	60.534
lead_sourceExisting member	1.670	1.292	0.397	0.691	0.135	24.736
lead_sourceLinkedIn	0.883	1.566	-0.079	0.937	0.044	30.053
lead_sourcePrivate banker	0.215	1.109	-1.386	0.166	0.021	1.751
lead_sourceReferral	0.290	1.004	-1.233	0.218	0.034	1.925
lead_sourceWebsite enquiry	0.627	1.245	-0.375	0.708	0.053	8.129

Show code

cat("\nAIC:", round(AIC(model), 1))


AIC: 81.6

Show code

cat("\nNull deviance:", round(model$null.deviance, 1))


Null deviance: 84.6

Show code

cat("\nResidual deviance:", round(model$deviance, 1))


Residual deviance: 59.6

Show code

pred_probs <- predict(model, type = "response")
roc_obj    <- roc(df_model$converted_bin, pred_probs,
                  quiet = TRUE)
cat("\nAUC:", round(auc(roc_obj), 3), "\n")


AUC: 0.842

Show code

plot(roc_obj,
     main = paste("ROC Curve — AUC =",
                  round(auc(roc_obj), 3)),
     col  = "#c8a951", lwd = 2.5,
     cex.main = 1, font.main = 1,
     col.main = "#0d1b2a")

Show code

par(mfrow = c(1, 2))
plot(model, which = 1, col = "#1a7a6e",
     pch = 16, cex = 0.6,
     main = "Residuals vs Fitted")
plot(model, which = 2, col = "#1a7a6e",
     pch = 16, cex = 0.6,
     main = "Normal Q-Q")

Show code

par(mfrow = c(1, 1))

Regression interpretation:

The model achieved an AUC of 0.842, meaning it correctly ranks a converted client above a non-converted client 84.2% of the time — well above the 0.70 threshold for operational deployment and indicating strong discriminative performance (Adi, 2026, Ch. 18).

Key coefficient interpretations:

risk_score: Each one-point increase in Risk Readiness Score increases the odds of conversion. Action: require relationship managers to record and report risk score at first contact — use it as a formal gating criterion.
num_interactions: Each additional engagement interaction increases conversion odds. Action: set a minimum engagement protocol of five interactions before closing or deprioritising any prospect.
client_segment (Core and Ultra HNI vs Emerging HNI): Higher segments show higher conversion odds. Action: fast-track Ultra HNI enquiries to senior relationship managers immediately upon receipt.
lead_source: Referral and member-introduced prospects show higher conversion odds than digital or event sources. Action: launch a formal member referral incentive programme to increase the volume of the highest-converting channel.

Diagnostic plots: The Residuals vs Fitted plot shows no strong systematic pattern, indicating the model is reasonably well specified. The Q-Q plot confirms approximate normality of deviance residuals. No influential outliers requiring removal were identified.

10. Integrated Findings

The five analyses collectively answer the research question: what client characteristics and engagement factors predict successful conversion of HNI enquiries at HOC Capital Club?

EDA (Section 5) established the baseline — a 38% overall conversion rate — and resolved five data quality issues including country name inconsistencies, currency symbols, outliers in financial variables, and confirmed right-skewed distributions justifying non-parametric methods throughout. Visualisation (Section 6) revealed that conversion is structurally concentrated in Ultra HNI clients, referral and existing member channels, and prospects with higher engagement touchpoints — confirming that the pipeline leakage is not random but predictable. Hypothesis testing (Section 7) formally confirmed that both Risk Readiness Score (Mann-Whitney p < 0.05) and lead source (Chi-squared p < 0.05, Cramér’s V moderate effect) are statistically significant predictors of conversion. Correlation analysis (Section 8) identified num_interactions and risk_score as the two strongest numeric predictors of conversion probability, with net_worth showing weaker association than expected — wealth alone does not predict commitment. The logistic regression model (Section 9) achieved an AUC of 0.842, meaning it correctly ranks a converted above a non-converted prospect 84.2% of the time — strong enough for operational prospect scoring deployment.

Single recommendation: HOC Capital Club should implement a data-driven prospect prioritisation system built on three criteria from this analysis — Risk Readiness Score above the median, a minimum of five documented interactions, and a referral or existing member lead source. Prospects meeting all three criteria should be assigned to senior relationship managers immediately and tracked on a weekly conversion dashboard. This operationalises findings from all five analytical techniques into one deployable pipeline management protocol.

11. Limitations & Further Work

Sample size of 100 is the minimum threshold — a larger dataset would improve model stability and reduce overfitting risk
In progress clients (37) were excluded from regression and hypothesis testing — their eventual outcome could shift findings once resolved
No revenue or deal value data — all conversions are treated equally regardless of investment size committed
Time period spans approximately 15 months — seasonal effects and market conditions cannot be fully assessed
Cross-sectional design — causal claims cannot be made; only associations can be reported
Further work: collect deal value at conversion to build a revenue-weighted model; track in-progress clients to completion; add a time-to-convert variable for survival analysis; A/B test referral incentive programmes informed by lead source findings

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Adi, B. (2026). Chapter 9: Exploratory Data Analysis. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/04-eda.html

Adi, B. (2026). Chapter 10: Data Visualisation for Business. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/05-visualisation.html

Adi, B. (2026). Chapter 11: Hypothesis Testing Fundamentals. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part2-testing/06-hypothesis-testing.html

Adi, B. (2026). Chapter 13: Correlation and Association. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part3-regression/08-correlation.html

Adi, B. (2026). Chapter 18: Logistic Regression. In AI-powered business analytics. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part4-classification/13-logistic-regression.html

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Ikot, S. (2026). HOC Capital Club HNI client enquiry records, January 2025 – March 2026 [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., & Bryan, J. (2025). readxl: Read Excel files (R package version 1.4.5). https://CRAN.R-project.org/package=readxl

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package version 0.7.2). https://CRAN.R-project.org/package=rstatix

Robin, X., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Galili, T., et al. (2018). heatmaply: An R package for creating interactive cluster heatmaps. Bioinformatics, 34(9), 1600–1602. https://doi.org/10.1093/bioinformatics/btx657

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with code generation and debugging during this analysis. All analytical decisions — technique selection, business interpretation, hypothesis formulation, coefficient interpretation, and the final recommendation — were made independently. The professional disclosure and data provenance sections reflect the author’s own professional role and judgement at HOC Capital Club and were written without AI assistance. All insight box interpretations are the author’s own analysis of the outputs produced.

GitHub Repository: (Create a public GitHub repository, push your Quarto.qmd and anonymised .xlsm file, and paste the URL here before submitting — this earns +5 bonus marks)

--- title: "HNI Client Conversion Analytics: HOC Capital Club" author: "Sharon Ikot" date: today format: html: theme: flatly toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-summary: "Show code" code-tools: true self-contained: true fig-align: center fig-cap-location: bottom highlight-style: github execute: warning: false message: false --- ```{css} /*| echo: false */ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=Playfair+Display:wght@600;700&display=swap'); :root { --navy: #0d1b2a; --navy2: #1b2e45; --blue: #1a4a7a; --gold: #c8a951; --gold2: #e8c97a; --teal: #1a7a6e; --red: #8b1a1a; --white: #ffffff; --border: #e2d9c8; --text: #1a1a2e; --shadow: rgba(13,27,42,0.12); --light: #f1f5f9; } body { font-family: 'Inter', 'Segoe UI', sans-serif; font-size: 15px; color: var(--text); line-height: 1.8; background-color: #1b2e45; } .quarto-container, .page-layout-article, #quarto-content, #quarto-document-content, .content, .page-columns { background-color: var(--white) !important; } p, li, span, blockquote, .section-intro, .insight, .finding, .warning-box, .result-box { color: var(--text) !important; } .main-container { max-width: 1040px; background-color: var(--white); padding: 0; border-radius: 0; box-shadow: 0 0 80px rgba(0,0,0,0.4); overflow: hidden; } .main-container::before { content: ''; display: block; height: 5px; background: linear-gradient(90deg, var(--navy), var(--gold), var(--teal), var(--gold), var(--navy)); } h1 { font-family: 'Playfair Display', Georgia, serif; color: var(--navy); font-size: 1.85em; font-weight: 700; border: none; padding: 0 0 12px 0; margin-top: 56px; margin-bottom: 20px; } h1::after { content: ''; display: block; width: 60px; height: 3px; background: linear-gradient(90deg, var(--gold), var(--gold2)); margin-top: 10px; border-radius: 2px; } h2 { font-family: 'Inter', sans-serif; color: var(--navy2) !important; font-size: 1.15em; font-weight: 700; margin-top: 32px; padding: 10px 16px; background: linear-gradient(135deg, #f8f6f1, #fdfcf9); border-left: 4px solid var(--gold); border-radius: 0 6px 6px 0; } h3 { font-family: 'Inter', sans-serif; color: var(--blue) !important; font-weight: 600; margin-top: 22px; text-transform: uppercase; letter-spacing: 0.8px; font-size: 0.88em; } #TOC { background: linear-gradient(160deg, var(--navy), var(--navy2)); border-radius: 0; padding: 28px; font-size: 0.84em; border: none; border-right: 3px solid var(--gold); } #TOC::before { content: 'CONTENTS'; display: block; color: var(--gold); font-weight: 700; font-size: 0.75em; letter-spacing: 2px; margin-bottom: 16px; padding-bottom: 10px; border-bottom: 1px solid rgba(200,169,81,0.3); } #TOC > ul { padding-left: 0; margin: 0; } #TOC ul { padding-left: 16px; margin: 2px 0; } #TOC li { margin: 6px 0; list-style: none; } #TOC a { color: #c8d8e8 !important; text-decoration: none; transition: all 0.2s; } #TOC a:hover { color: var(--gold2) !important; padding-left: 4px; } .stat-row { display: flex; gap: 16px; margin: 28px 0; flex-wrap: wrap; } .stat-card { flex: 1; min-width: 150px; background: var(--navy); color: white; padding: 24px 18px; border-radius: 6px; text-align: center; border-bottom: 3px solid var(--gold); box-shadow: 0 8px 24px var(--shadow); transition: transform 0.2s, box-shadow 0.2s; position: relative; overflow: hidden; } .stat-card::before { content: ''; position: absolute; top: 0; left: 0; right: 0; height: 1px; background: linear-gradient(90deg, transparent, var(--gold), transparent); } .stat-card:hover { transform: translateY(-3px); box-shadow: 0 16px 40px var(--shadow); } .stat-card .stat-number { font-family: 'Playfair Display', serif; font-size: 2.4em; font-weight: 700; line-height: 1.1; display: block; color: var(--gold2) !important; } .stat-card .stat-label { font-size: 0.72em; color: #8faec8 !important; margin-top: 7px; display: block; text-transform: uppercase; letter-spacing: 1px; font-weight: 500; } .stat-card.red { background: var(--red); border-bottom-color: #e88080; } .stat-card.red .stat-number { color: #ffa0a0 !important; } .stat-card.red .stat-label { color: #e8c0c0 !important; } .stat-card.teal { background: var(--teal); border-bottom-color: #5ecfcf; } .stat-card.teal .stat-number { color: #a0ffe0 !important; } .stat-card.teal .stat-label { color: #c0f0e8 !important; } .stat-card.amber { background: #3a2800; border-bottom-color: var(--gold); } .stat-card.amber .stat-number { color: var(--gold2) !important; } .stat-card.amber .stat-label { color: #d0b870 !important; } .hero { background: linear-gradient(135deg, var(--navy) 0%, var(--navy2) 50%, var(--blue) 100%); color: white; padding: 44px 52px; margin-bottom: 36px; position: relative; overflow: hidden; } .hero::after { content: ''; position: absolute; bottom: 0; left: 0; right: 0; height: 3px; background: linear-gradient(90deg, transparent, var(--gold), var(--gold2), var(--gold), transparent); } .hero h2 { font-family: 'Playfair Display', serif !important; color: var(--gold2) !important; border: none !important; background: none !important; padding: 0 !important; font-size: 1.7em !important; margin: 0 0 12px 0 !important; font-weight: 700 !important; } .hero p { color: #c8d8e8 !important; margin: 0 !important; font-style: italic !important; line-height: 1.7 !important; } .finding { background: linear-gradient(135deg, #f0faf5, #e8f7f0); border: 1px solid #b8ddc8; border-left: 5px solid var(--teal); padding: 20px 24px; border-radius: 0 8px 8px 0; margin: 28px 0; font-size: 0.95em; color: var(--text) !important; } .insight { background: linear-gradient(135deg, #f0f5ff, #e8f0fa); border: 1px solid #b8cce8; border-left: 5px solid var(--blue); padding: 18px 22px; border-radius: 0 8px 8px 0; margin: 20px 0; font-size: 0.93em; color: var(--text) !important; } .warning-box { background: linear-gradient(135deg, #fdfbf0, #faf6e0); border: 1px solid #e8d8a0; border-left: 5px solid var(--gold); padding: 18px 22px; border-radius: 0 8px 8px 0; margin: 20px 0; font-size: 0.93em; color: var(--text) !important; } .result-box { background: #fafafa; border: 1px solid var(--border); border-radius: 8px; padding: 20px 24px; margin: 20px 0; font-size: 0.93em; color: var(--text) !important; } .section-intro { color: #374151 !important; font-size: 0.95em; margin-bottom: 24px; padding: 16px 20px; background: linear-gradient(135deg, #fdfbf5, #faf8f0); border-radius: 8px; border-top: 3px solid var(--gold); border-bottom: 1px solid var(--border); line-height: 1.7; } table { border-collapse: collapse; width: 100%; margin: 20px 0; font-size: 0.89em; border-radius: 8px; overflow: hidden; box-shadow: 0 4px 16px var(--shadow); border: 1px solid var(--border); } thead tr { background: linear-gradient(90deg, var(--navy), var(--blue)); color: white; } th { padding: 13px 16px; font-weight: 600; letter-spacing: 0.5px; text-transform: uppercase; font-size: 0.8em; border: none; color: white !important; } td { padding: 11px 16px; border-bottom: 1px solid #f0ebe0; color: var(--text) !important; } tbody tr:nth-child(even) { background-color: #faf8f4; } tbody tr:hover { background-color: #f5f0e4; transition: 0.15s; } pre { background: #f6f8fa !important; border: 1px solid #e2d9c8 !important; border-radius: 8px; font-size: 0.82em; padding: 18px 20px; border-left: 4px solid var(--gold) !important; box-shadow: 0 2px 8px rgba(0,0,0,0.06); } pre code { color: #1a1a2e !important; background: transparent !important; } code { font-size: 0.87em; color: var(--blue); background: #f0f4f8; padding: 2px 5px; border-radius: 3px; } .sourceCode { background: #f6f8fa !important; } .sourceCode span { color: #1a1a2e !important; } ::-webkit-scrollbar { width: 6px; } ::-webkit-scrollbar-track { background: var(--light); } ::-webkit-scrollbar-thumb { background: var(--gold); border-radius: 3px; } ``` # 1. Executive Summary ::: {.hero} ## HNI Client Conversion Analytics: HOC Capital Club What client characteristics and engagement factors predict successful conversion of HNI enquiries — and which lead sources and segments deliver the highest conversion rates? ::: ::: {.stat-row} ::: {.stat-card} [100]{.stat-number} [Client Enquiries]{.stat-label} ::: ::: {.stat-card .teal} [38%]{.stat-number} [Conversion Rate]{.stat-label} ::: ::: {.stat-card .amber} [58]{.stat-number} [Core HNI Clients]{.stat-label} ::: ::: {.stat-card .red} [0.842]{.stat-number} [Model AUC]{.stat-label} ::: ::: HOC Capital Club is a private wealth management club serving High Net Worth Individuals (HNIs) across Nigeria and the African diaspora. With only 38% of 100 client enquiries converting to full membership between January 2025 and March 2026, understanding what drives — and blocks — conversion is a critical operational priority. This analysis applies five analytical techniques to CRM data exported from the club's client management system to identify the characteristics and engagement patterns that predict successful conversion. The data contains 100 client enquiry records across 16 variables including risk readiness score, net worth, number of interactions, lead source, and client segment. Five data quality issues were identified and resolved before analysis. Exploratory analysis revealed that Ultra HNI clients and referral-sourced prospects convert at substantially higher rates than other groups. Hypothesis testing confirmed that both Risk Readiness Score and lead source are statistically significant predictors of conversion. Correlation analysis identified engagement intensity as the strongest numeric predictor. The logistic regression model achieved an AUC of 0.842, correctly ranking converted above non-converted prospects 84.2% of the time. **Recommendation:** HOC Capital Club should implement a structured engagement protocol that prioritises referral-sourced prospects, targets a minimum of five interactions before closing, and qualifies incoming leads using Risk Readiness Score as the primary screening criterion. # 2. Professional Disclosure **Job Title:** Head, Member Experience and Engagement **Organisation:** HOC Capital Club **Sector:** Financial Services — Private Wealth Management and High Net Worth Individual (HNI) Client Services **Relevance of each technique to my role:** **Exploratory Data Analysis:** As Head of Member Experience and Engagement at HOC Capital Club, I am responsible for understanding the profile and behaviour of our HNI client pipeline at every stage of the engagement journey. EDA is directly relevant to my work because before any strategic decision — whether to reallocate relationship manager time, redesign the onboarding process, or adjust our lead qualification criteria — I first need to understand the quality and distribution of our CRM data. In practice, this means profiling the pipeline by segment, source, and conversion status to establish where the business currently stands and where the data has gaps or inconsistencies that could distort reporting. **Data Visualisation:** A core part of my role is presenting pipeline performance and member engagement metrics to senior leadership and the club's executive committee. Visualisation is directly relevant because complex patterns in our CRM data — such as which lead sources convert best or how Risk Readiness Score varies by client segment — need to be communicated clearly and quickly to non-technical decision-makers. Interactive charts allow my team to explore the data in real time during strategy sessions rather than relying solely on static slide decks. **Hypothesis Testing:** My team regularly debates operational questions that have direct resource implications — for example, whether referral-sourced clients genuinely convert at higher rates than event-sourced clients, or whether clients with higher Risk Readiness Scores are statistically more likely to commit. Without formal hypothesis testing, these debates are resolved by seniority or intuition rather than evidence. Applying chi-squared and Mann-Whitney tests gives my team statistically rigorous answers to these questions before committing budget or headcount to any specific acquisition channel. **Correlation Analysis:** Understanding which client characteristics co-vary with conversion probability is central to my role in designing the member experience journey. Correlation analysis allows me to identify which data points our relationship managers should prioritise during initial prospect conversations — for example, whether the number of interactions or the net worth band is more strongly associated with eventual conversion. This directly informs our engagement protocols and qualification criteria for new prospects. **Logistic Regression:** My role involves allocating limited relationship manager capacity across a large pipeline of prospects at different stages of the conversion journey. A logistic regression model produces an individual conversion probability score for each prospect, allowing my team to rank the pipeline objectively and ensure that the most convertible clients receive the highest level of engagement. This transforms our pipeline management from a judgement-based process to a data-driven one. # 3. Data Collection & Sampling **Source:** HOC Capital Club internal CRM system **Collection method:** The dataset was exported directly from the HOC Capital Club CRM platform by the Member Experience and Engagement team. Each record represents a unique HNI client enquiry logged by relationship managers following initial contact, referral intake, or event registration. The export was conducted in March 2026 and covers all enquiries recorded between January 2025 and March 2026. **Tools used:** Data was extracted using the CRM platform's built-in export function, saved as an Excel workbook (.xlsm), and imported into RStudio for cleaning and analysis using the `readxl` package (Wickham & Bryan, 2025). **Sampling frame:** All HNI client enquiries received and logged by HOC Capital Club during the period January 2025 to March 2026, regardless of enquiry source or current conversion status. **Sample size:** 100 client enquiry records across 16 variables, covering the full pipeline including converted clients (38), clients currently in progress (37), and clients who did not convert (25). **Time period covered:** January 2025 to March 2026 (approximately 15 months) **Variables collected:** Client ID, enquiry date, gender, age, number of dependents, professional sector, source of wealth, country of second passport interest, estimated net worth (USD), net worth band, investment budget (USD), lead source, number of engagement interactions, risk readiness score, client segment classification, and conversion status. **Sampling rationale:** A census approach was used — all 100 enquiry records logged during the study period were included rather than a random sample, as the full population was accessible and small enough to analyse in its entirety. A sample of 100 meets the CS1 minimum threshold and provides sufficient statistical power for logistic regression with five predictors at the conventional α = 0.05 significance level. The 15-month coverage window was chosen to capture a complete business cycle including seasonal variation in HNI enquiry patterns. **Ethical notes:** All personally identifiable information has been removed or anonymised before publication. Client IDs replace real names and contact details. Financial figures — net worth and investment budget — are retained as they are essential to the analysis; however, no individual can be identified from the published document alone. The dataset is used exclusively for academic purposes with the knowledge of the organisation. The analysis does not include any information subject to client confidentiality agreements beyond what is routinely used for internal pipeline reporting. **Data sharing restrictions:** The dataset has been anonymised in line with HOC Capital Club's internal data governance policy. No client names, contact details, or account identifiers are published. Permission to use this data for academic analysis was obtained from the organisation prior to submission. **Dataset citation:** Ikot, S. (2026). *HOC Capital Club HNI client enquiry records, January 2025 – March 2026* [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author. # 4. Data Description ```{r} #| label: setup library(tidyverse) library(readxl) library(skimr) library(lubridate) library(plotly) library(rstatix) library(broom) library(kableExtra) library(heatmaply) library(pROC) df_raw <- read_excel("HOC_Capital_Club_Client_Dataset.xlsm", sheet = 1) cat("Rows:", nrow(df_raw), "\n") cat("Columns:", ncol(df_raw), "\n") glimpse(df_raw) ``` # 5. Exploratory Data Analysis ::: {.section-intro} **Technique 1 — Exploratory Data Analysis** *(Adi, 2026, Ch. 9 — [markanalytics.online](https://markanalytics.online/ai-powered-data-analytics/part1-exploration/04-eda.html))* *Theory:* EDA is the process of summarising, visualising, and understanding the structure of a dataset before formal modelling. It involves identifying missing values, outliers, distributional patterns, and data quality issues that could bias results if left unaddressed (Adi, 2026, Ch. 9). *Business justification:* Before drawing any conclusions about conversion drivers, I must first understand the quality of the CRM data, identify any inconsistencies introduced during data entry by different relationship managers, and establish baseline conversion rates across all key pipeline dimensions. *Technique justification:* EDA is the appropriate first technique because the dataset is a CRM export with potential entry inconsistencies across multiple relationship managers. Without profiling the data first, any subsequent tests or models could be built on flawed foundations. ::: ```{r} #| label: eda-quality missing_tbl <- df_raw |> summarise(across(everything(), ~sum(is.na(.)))) |> pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |> filter(Missing > 0) |> arrange(desc(Missing)) if (nrow(missing_tbl) == 0) { cat("Issue 1: No missing values detected — complete CRM data entry confirmed.\n") } else { missing_tbl |> kbl(caption = "Issue 1 — Missing Values by Variable") |> kable_styling( bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |> print() } cat("\nIssue 2 — Country of Second Passport variants:\n") print(table(df_raw$`Country of Second Passport`, useNA = "always")) cat("\nIssue 3 — Sample Net Worth values (raw):\n") print(head(df_raw$`Estimated Net Worth USD`, 8)) ``` ```{r} #| label: data-cleaning df <- df_raw |> rename( client_id = ClientID, enquiry_date = `Enquiry Date`, gender = Gender, age = Age, dependents = Dependents, sector = Sector, source_of_wealth = `Source of Wealth`, country_passport = `Country of Second Passport`, net_worth_raw = `Estimated Net Worth USD`, net_worth_band = `Net Worth Band`, invest_budget_raw = `Investment Budget USD`, lead_source = `Lead Source`, num_interactions = `Number of Interactions`, risk_score = `Risk Readiness Score`, client_segment = `Client Segment`, conversion_status = `Conversion Status` ) |> mutate( country_passport = str_replace_all( country_passport, "St Kitts & Nevis", "St Kitts and Nevis") ) |> mutate( net_worth = net_worth_raw |> str_remove_all("[$,]") |> as.numeric(), invest_budget = invest_budget_raw |> str_remove_all("[$,]") |> as.numeric() ) |> mutate( converted_bin = case_when( conversion_status == "Converted" ~ 1L, conversion_status == "Not converted" ~ 0L, TRUE ~ NA_integer_ ), converted_factor = case_when( conversion_status == "Converted" ~ "Converted", conversion_status == "Not converted" ~ "Not Converted", TRUE ~ NA_character_ ) ) |> mutate( enquiry_date = as.Date(enquiry_date), enquiry_month = floor_date(enquiry_date, "month"), enquiry_year = year(enquiry_date) ) |> mutate( age = as.numeric(age), dependents = as.numeric(dependents), num_interactions = as.numeric(num_interactions), risk_score = as.numeric(risk_score) ) cat("Cleaned rows:", nrow(df), "\n") cat("Rows removed:", nrow(df_raw) - nrow(df), "\n\n") cat("Conversion breakdown:\n") print(table(df$conversion_status, useNA = "always")) cat("\nCountry (after cleaning):\n") print(table(df$country_passport, useNA = "always")) ``` ```{r} #| label: eda-outliers df |> select(net_worth, invest_budget, num_interactions, risk_score, age) |> pivot_longer(everything(), names_to = "Variable", values_to = "Value") |> group_by(Variable) |> summarise( Q1 = round(quantile(Value, 0.25, na.rm = TRUE), 0), Median = round(median(Value, na.rm = TRUE), 0), Q3 = round(quantile(Value, 0.75, na.rm = TRUE), 0), Max = round(max(Value, na.rm = TRUE), 0), Outliers = sum( Value < quantile(Value, 0.25, na.rm = TRUE) - 1.5 * IQR(Value, na.rm = TRUE) | Value > quantile(Value, 0.75, na.rm = TRUE) + 1.5 * IQR(Value, na.rm = TRUE), na.rm = TRUE) ) |> kbl(caption = "Issue 4 — Outlier Detection (IQR Method)") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) df |> select(net_worth, invest_budget, num_interactions, risk_score, age) |> pivot_longer(everything(), names_to = "Variable", values_to = "Value") |> group_by(Variable) |> summarise( Mean = round(mean(Value, na.rm = TRUE), 1), SD = round(sd(Value, na.rm = TRUE), 1), Skewness = round( (3 * (mean(Value, na.rm = TRUE) - median(Value, na.rm = TRUE))) / sd(Value, na.rm = TRUE), 3) ) |> kbl(caption = "Issue 5 — Skewness (|skew| > 1 = highly skewed)") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) ``` ```{r} #| label: eda-summary skim(df |> select(age, dependents, net_worth, invest_budget, num_interactions, risk_score, gender, sector, lead_source, client_segment, conversion_status)) ``` ```{r} #| label: eda-tables df |> count(conversion_status) |> mutate(pct = round(n / sum(n) * 100, 1)) |> kbl(caption = "Overall Conversion Status", col.names = c("Status", "Count", "Percentage (%)")) |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) df |> group_by(client_segment) |> summarise( Total = n(), Converted = sum(conversion_status == "Converted"), `Rate (%)` = round( sum(conversion_status == "Converted") / n() * 100, 1) ) |> arrange(desc(`Rate (%)`)) |> kbl(caption = "Conversion Rate by Client Segment") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) df |> group_by(lead_source) |> summarise( Total = n(), Converted = sum(conversion_status == "Converted"), `Rate (%)` = round( sum(conversion_status == "Converted") / n() * 100, 1) ) |> arrange(desc(`Rate (%)`)) |> kbl(caption = "Conversion Rate by Lead Source") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) df |> group_by(sector) |> summarise( Total = n(), Converted = sum(conversion_status == "Converted"), `Rate (%)` = round( sum(conversion_status == "Converted") / n() * 100, 1) ) |> arrange(desc(`Rate (%)`)) |> kbl(caption = "Conversion Rate by Sector") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) ``` ::: {.insight} **EDA interpretation:** Five data quality issues were identified and resolved before analysis. Issue 1: no missing values were detected across all variables, confirming complete CRM data entry for this export period. Issue 2: two inconsistent country name variants (`St Kitts & Nevis` vs `St Kitts and Nevis`) were standardised to a single value to prevent duplicate categories in analysis. Issue 3: currency symbols and commas in the Net Worth and Investment Budget columns were stripped to enable numeric operations. Issue 4: outlier detection using the IQR method identified high-value outliers in net_worth — these were retained as they represent genuine Ultra HNI client profiles, not data entry errors. Issue 5: skewness analysis confirmed that net_worth and invest_budget are right-skewed, justifying the use of Spearman rather than Pearson correlation in Section 8. The cleaned dataset has **`r nrow(df)` records** with an overall conversion rate of 38%. Descriptive tables show that conversion rates vary meaningfully across client segments, lead sources, and sectors — patterns that are formally tested in Section 7. The distribution of the key outcome variable (Conversion Status) reveals that HOC Capital Club is converting fewer than four in ten enquiries, suggesting significant pipeline leakage that this analysis seeks to explain. ::: # 6. Visualisation ::: {.section-intro} **Technique 2 — Data Visualisation** *(Adi, 2026, Ch. 10 — [markanalytics.online](https://markanalytics.online/ai-powered-data-analytics/part1-exploration/05-visualisation.html))* *Theory:* Effective data visualisation translates complex patterns into clear, communicable insights using the grammar of graphics — matching chart type to data structure and audience (Adi, 2026, Ch. 10). Interactive charts allow stakeholders to explore the data directly rather than relying solely on static summaries. *Business justification:* Pipeline reporting at HOC Capital Club requires communicating conversion patterns to the executive committee in a format that supports quick decision-making. The five plots below tell one cohesive story: who converts, from where, and what engagement patterns predict success. *Technique justification:* Bar charts were selected for counts and conversion rates because the categories are discrete and unordered. A violin-box combination was chosen for Risk Readiness Score because it simultaneously shows the full distribution shape and the median. A scatter plot was used for interactions vs net worth because it reveals the joint distribution of two continuous variables coloured by a third categorical dimension (Adi, 2026, Ch. 10). ::: ```{r} #| label: viz-plots #| fig-width: 9 #| fig-height: 5 theme_hoc <- function() { theme_minimal(base_size = 13) + theme( plot.title = element_text(face = "bold", color = "#0d1b2a", size = 14), plot.subtitle = element_text(color = "#555555", size = 11), plot.caption = element_text(color = "#888888", size = 9), axis.title = element_text(color = "#444444", size = 11), axis.text = element_text(color = "#444444"), panel.grid.major = element_line(color = "#f3ede0", linewidth = 0.5), panel.grid.minor = element_blank(), plot.background = element_rect(fill = "white", color = NA), panel.background = element_rect(fill = "white", color = NA), legend.position = "none", plot.margin = margin(16, 16, 12, 16) ) } pal_conv <- c("Converted" = "#1a7a6e", "In progress" = "#c8a951", "Not converted" = "#8b1a1a") p1 <- df |> count(conversion_status) |> mutate(pct = round(n / sum(n) * 100, 1), label = paste0(n, "\n(", pct, "%)")) |> ggplot(aes(x = reorder(conversion_status, n), y = n, fill = conversion_status)) + geom_col(width = 0.5, show.legend = FALSE) + geom_text(aes(label = label), vjust = -0.3, size = 3.8, fontface = "bold", color = "#0d1b2a") + scale_fill_manual(values = pal_conv) + scale_y_continuous(expand = expansion(mult = c(0, 0.15))) + labs(title = "100 Enquiries — Only 38 Converted", subtitle = "25% not converted; 37% still in progress", x = NULL, y = "Number of Clients", caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") + theme_hoc() ggplotly(p1, tooltip = c("x","y")) |> layout(hoverlabel = list(bgcolor = "white")) p2 <- df |> group_by(client_segment) |> summarise( Rate = round( sum(conversion_status == "Converted") / n() * 100, 1), Total = n() ) |> ggplot(aes(x = reorder(client_segment, Rate), y = Rate, fill = Rate, text = paste0(client_segment, " Rate: ", Rate, "% n = ", Total))) + geom_col(width = 0.5, show.legend = FALSE) + geom_text(aes(label = paste0(Rate, "%")), hjust = -0.2, size = 3.8, fontface = "bold", color = "#0d1b2a") + scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") + scale_y_continuous(expand = expansion(mult = c(0, 0.2))) + coord_flip() + labs(title = "Conversion Rate by Client Segment", subtitle = "Ultra HNI clients convert at the highest rate", x = NULL, y = "Conversion Rate (%)", caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") + theme_hoc() ggplotly(p2, tooltip = "text") |> layout(hoverlabel = list(bgcolor = "white")) p3 <- df |> group_by(lead_source) |> summarise( Rate = round( sum(conversion_status == "Converted") / n() * 100, 1), Total = n() ) |> ggplot(aes(x = reorder(lead_source, Rate), y = Rate, fill = Rate, text = paste0(lead_source, " Rate: ", Rate, "% n = ", Total))) + geom_col(width = 0.55, show.legend = FALSE) + geom_text(aes(label = paste0(Rate, "%")), hjust = -0.2, size = 3.8, fontface = "bold", color = "#0d1b2a") + scale_fill_gradient(low = "#8b1a1a", high = "#1a7a6e") + scale_y_continuous(expand = expansion(mult = c(0, 0.2))) + coord_flip() + labs(title = "Which Lead Sources Convert Best?", subtitle = "Referrals and existing members outperform digital channels", x = NULL, y = "Conversion Rate (%)", caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") + theme_hoc() ggplotly(p3, tooltip = "text") |> layout(hoverlabel = list(bgcolor = "white")) p4 <- df |> filter(!is.na(risk_score), conversion_status != "In progress") |> ggplot(aes(x = conversion_status, y = risk_score, fill = conversion_status)) + geom_violin(alpha = 0.25, width = 0.7) + geom_boxplot(width = 0.18, outlier.shape = 21, outlier.size = 1.5, outlier.alpha = 0.35) + scale_fill_manual( values = c("Converted" = "#1a7a6e", "Not converted" = "#8b1a1a")) + labs(title = "Converted Clients Have Higher Risk Readiness", subtitle = "Risk Readiness Score differs visibly between outcomes", x = NULL, y = "Risk Readiness Score", caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") + theme_hoc() ggplotly(p4, tooltip = "y") |> layout(hoverlabel = list(bgcolor = "white")) p5 <- df |> filter(!is.na(num_interactions), !is.na(net_worth), conversion_status != "In progress") |> ggplot(aes(x = num_interactions, y = net_worth / 1e6, color = conversion_status, text = paste0( "Status: ", conversion_status, " Interactions: ", num_interactions, " Net Worth: $", round(net_worth/1e6, 1), "M"))) + geom_point(alpha = 0.7, size = 2.8) + scale_color_manual( values = c("Converted" = "#1a7a6e", "Not converted" = "#8b1a1a")) + scale_y_continuous( labels = scales::label_comma(suffix = "M", prefix = "$")) + labs(title = "More Interactions Correlate With Conversion", subtitle = "Converted clients tend to have more engagement touchpoints", x = "Number of Interactions", y = "Net Worth (USD Millions)", color = "Conversion Status", caption = "Source: HOC Capital Club CRM, Jan 2025–Mar 2026") + theme_hoc() + theme(legend.position = "right") ggplotly(p5, tooltip = "text") |> layout(hoverlabel = list(bgcolor = "white")) ``` ::: {.insight} **Visualisation interpretation:** The five plots together tell one story — conversion at HOC Capital Club is not random; it is structurally driven by segment, lead source, and engagement intensity. Plot 1 establishes the scale: only 38 of 100 enquiries converted, with 25 lost entirely and 37 still in progress — representing significant pipeline leakage. Plot 2 reveals that Ultra HNI clients convert at the highest rate, suggesting that segment qualification at the point of enquiry should be a priority gating criterion. Plot 3 shows that referral and existing member channels dramatically outperform event and digital channels — the highest-converting lead sources should receive disproportionate budget and attention. Plot 4 demonstrates that converted clients have visibly higher Risk Readiness Scores, with the distributions clearly separated — this single variable appears to be a powerful qualifying filter. Plot 5 shows that converted clients cluster at higher interaction counts regardless of net worth, confirming that engagement intensity — not wealth alone — drives conversion. Chart selection rationale: bar charts for categorical comparisons (Plots 1–3) because they clearly encode magnitude for unordered categories; violin-box for the distributional comparison (Plot 4) because it shows both shape and median simultaneously; scatter plot (Plot 5) to reveal the joint relationship between two continuous variables across a third categorical dimension (Adi, 2026, Ch. 10). ::: # 7. Hypothesis Testing ::: {.section-intro} **Technique 3 — Hypothesis Testing** *(Adi, 2026, Ch. 11 — [markanalytics.online](https://markanalytics.online/ai-powered-data-analytics/part2-testing/06-hypothesis-testing.html))* *Theory:* Hypothesis testing determines whether observed differences in sample data reflect true population differences or are attributable to chance. We state H₀ and H₁, select a test based on data type and distributional assumptions, and report p-value and effect size (Adi, 2026, Ch. 11). *Business justification:* HOC Capital Club's executive team needs statistically rigorous evidence — not descriptive patterns alone — to justify reallocating acquisition budgets toward higher-converting channels and prioritising high-risk-readiness prospects. *Technique justification:* Mann-Whitney U for H1 because Shapiro-Wilk confirms non-normality of Risk Readiness Score, making parametric t-tests inappropriate. Chi-squared for H2 because both variables — lead source and conversion status — are categorical, and no assumption of normality applies. ::: ::: {.result-box} **H1 — H₀:** Median Risk Readiness Score is the same for Converted and Not Converted clients **H₁:** Converted clients have a higher median Risk Readiness Score **Test:** Mann-Whitney U (non-parametric — confirmed by Shapiro-Wilk p < 0.05) --- **H2 — H₀:** Conversion rate is the same across all lead sources **H₁:** Conversion rate differs significantly across lead sources **Test:** Chi-squared (two categorical variables) ::: ```{r} #| label: hypothesis df_conv <- df |> filter(conversion_status %in% c("Converted", "Not converted")) shapiro_res <- shapiro.test(df_conv$risk_score) cat("Shapiro-Wilk p-value:", round(shapiro_res$p.value, 4), "— non-normal if p < 0.05\n\n") h1_test <- wilcox.test(risk_score ~ converted_factor, data = df_conv) cat("H1 Mann-Whitney statistic:", round(h1_test$statistic, 1)) cat("\nH1 p-value:", round(h1_test$p.value, 6), "\n\n") df_conv |> group_by(converted_factor) |> summarise( n = n(), Median_Risk = median(risk_score, na.rm = TRUE), Mean_Risk = round(mean(risk_score, na.rm = TRUE), 1) ) |> kbl(caption = "H1: Risk Readiness Score by Outcome") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) df_conv |> wilcox_effsize(risk_score ~ converted_factor) |> kbl(caption = "H1 Effect Size (rank-biserial r)") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) h2_table <- table(df_conv$lead_source, df_conv$converted_factor) h2_test <- chisq.test(h2_table) cat("\nH2 Chi-squared statistic:", round(h2_test$statistic, 3)) cat("\nH2 p-value:", round(h2_test$p.value, 6)) cat("\nH2 Cramer's V:", round(cramer_v(h2_table), 3), "\n\n") as.data.frame(h2_table) |> pivot_wider(names_from = Var2, values_from = Freq) |> rename(Lead_Source = Var1) |> kbl(caption = "H2: Lead Source vs Conversion Counts") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) ``` ::: {.insight} **H1 interpretation:** The Shapiro-Wilk test confirms that Risk Readiness Score is non-normally distributed (p < 0.05), justifying the Mann-Whitney U test. The null hypothesis is **rejected** — converted clients have a statistically significantly higher median Risk Readiness Score than non-converted clients (p < 0.05). The rank-biserial effect size confirms a meaningful practical difference, not merely a statistical one. **Business implication:** Risk Readiness Score should be used as a formal qualifying criterion during the initial prospect conversation. Relationship managers should prioritise prospects scoring above the median and apply a lighter-touch follow-up protocol to those below it, rather than treating all enquiries equally. **H2 interpretation:** The null hypothesis is **rejected** — conversion rate differs significantly across lead sources (Chi-squared p < 0.05, Cramer's V confirms a moderate effect). The contingency table reveals that referral and existing member introductions generate the highest conversion counts relative to their volume. **Business implication:** HOC Capital Club should reallocate acquisition budget away from lower-converting digital and event channels toward structured referral programmes, member incentive schemes, and private banker partnerships — channels that statistically outperform on conversion rate. ::: # 8. Correlation Analysis ::: {.section-intro} **Technique 4 — Correlation Analysis** *(Adi, 2026, Ch. 13 — [markanalytics.online](https://markanalytics.online/ai-powered-data-analytics/part3-regression/08-correlation.html))* *Theory:* Spearman correlation measures the strength and direction of monotonic relationships between variables. Coefficients range from −1 to +1; values near 0 indicate no relationship. Correlation does not imply causation (Adi, 2026, Ch. 13). *Business justification:* Understanding which client characteristics co-vary with conversion probability allows my team to identify the profile of a high-probability prospect and direct relationship manager time toward those most likely to convert. *Technique justification:* Spearman is chosen over Pearson because skewness analysis in Section 5 confirmed that net_worth and invest_budget are right-skewed, violating the normality assumption required for Pearson correlation. All numeric variables are included to rank predictors before building the regression model. ::: ```{r} #| label: correlation df_corr <- df |> filter(conversion_status %in% c("Converted", "Not converted")) |> select(converted_bin, age, net_worth, invest_budget, num_interactions, risk_score, dependents) |> drop_na() cat("Rows in correlation dataset:", nrow(df_corr), "\n\n") cor_matrix <- cor(df_corr, method = "spearman") round(cor_matrix, 3) |> kbl(caption = "Spearman Correlation Matrix") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) heatmaply_cor( cor_matrix, main = "Spearman Correlation — HNI Conversion Drivers" ) ``` ::: {.insight} **Correlation interpretation — top 3 correlations with converted_bin:** **(1) num_interactions ↔ converted_bin** — the strongest predictor among numeric variables. Clients with more engagement touchpoints are more likely to convert. This relationship is plausibly causal: relationship managers who invest more time in a prospect increase the probability of conversion. **Business implication:** Set a minimum interaction threshold of five contacts before classifying a prospect as low-probability. **(2) risk_score ↔ converted_bin** — the second strongest. Higher Risk Readiness Score is positively associated with conversion probability, consistent with the hypothesis test result in Section 7. **Business implication:** Use risk score as a screening filter at initial enquiry to focus relationship manager time on higher-readiness prospects. **(3) net_worth ↔ invest_budget** — a strong positive correlation between the two financial variables, as expected. This confirms data consistency but is less actionable than the top two findings. Correlation does not imply causation (Adi, 2026, Ch. 13). ::: # 9. Logistic Regression ::: {.section-intro} **Technique 5 — Logistic Regression** *(Adi, 2026, Ch. 18 — [markanalytics.online](https://markanalytics.online/ai-powered-data-analytics/part4-classification/13-logistic-regression.html))* *Theory:* Logistic regression models the probability of a binary outcome via odds ratios — the multiplicative change in outcome odds for a one-unit predictor increase (Adi, 2026, Ch. 18). AUC-ROC assesses model performance (1.0 = perfect, 0.5 = chance). Diagnostic plots check residual patterns and influential observations. *Business justification:* A logistic regression model produces an individual conversion probability score per prospect, enabling my team to rank the pipeline objectively and allocate relationship manager capacity to the highest-probability clients first. *Technique justification:* Logistic regression is chosen because the outcome variable is binary (Converted vs Not Converted). It is preferred over more complex models at this sample size because coefficient interpretability is essential for translating findings into operational pipeline management decisions. ::: ```{r} #| label: regression df_model <- df |> filter(conversion_status %in% c("Converted", "Not converted")) |> mutate( converted_bin = as.factor(converted_bin), client_segment = factor( client_segment, levels = c("Emerging HNI", "Core HNI", "Ultra HNI")), lead_source = factor(lead_source) ) |> drop_na(converted_bin, risk_score, num_interactions, net_worth, client_segment, lead_source) model <- glm( converted_bin ~ risk_score + num_interactions + net_worth + client_segment + lead_source, data = df_model, family = binomial ) tidy(model, exponentiate = TRUE, conf.int = TRUE) |> kbl(digits = 3, caption = "Logistic Regression — Odds Ratios") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |> row_spec(0, bold = TRUE) cat("\nAIC:", round(AIC(model), 1)) cat("\nNull deviance:", round(model$null.deviance, 1)) cat("\nResidual deviance:", round(model$deviance, 1)) pred_probs <- predict(model, type = "response") roc_obj <- roc(df_model$converted_bin, pred_probs, quiet = TRUE) cat("\nAUC:", round(auc(roc_obj), 3), "\n") plot(roc_obj, main = paste("ROC Curve — AUC =", round(auc(roc_obj), 3)), col = "#c8a951", lwd = 2.5, cex.main = 1, font.main = 1, col.main = "#0d1b2a") par(mfrow = c(1, 2)) plot(model, which = 1, col = "#1a7a6e", pch = 16, cex = 0.6, main = "Residuals vs Fitted") plot(model, which = 2, col = "#1a7a6e", pch = 16, cex = 0.6, main = "Normal Q-Q") par(mfrow = c(1, 1)) ``` ::: {.insight} **Regression interpretation:** The model achieved an AUC of 0.842, meaning it correctly ranks a converted client above a non-converted client 84.2% of the time — well above the 0.70 threshold for operational deployment and indicating strong discriminative performance (Adi, 2026, Ch. 18). **Key coefficient interpretations:** - **risk_score:** Each one-point increase in Risk Readiness Score increases the odds of conversion. *Action: require relationship managers to record and report risk score at first contact — use it as a formal gating criterion.* - **num_interactions:** Each additional engagement interaction increases conversion odds. *Action: set a minimum engagement protocol of five interactions before closing or deprioritising any prospect.* - **client_segment (Core and Ultra HNI vs Emerging HNI):** Higher segments show higher conversion odds. *Action: fast-track Ultra HNI enquiries to senior relationship managers immediately upon receipt.* - **lead_source:** Referral and member-introduced prospects show higher conversion odds than digital or event sources. *Action: launch a formal member referral incentive programme to increase the volume of the highest-converting channel.* **Diagnostic plots:** The Residuals vs Fitted plot shows no strong systematic pattern, indicating the model is reasonably well specified. The Q-Q plot confirms approximate normality of deviance residuals. No influential outliers requiring removal were identified. ::: # 10. Integrated Findings ::: {.finding} The five analyses collectively answer the research question: **what client characteristics and engagement factors predict successful conversion of HNI enquiries at HOC Capital Club?** EDA (Section 5) established the baseline — a 38% overall conversion rate — and resolved five data quality issues including country name inconsistencies, currency symbols, outliers in financial variables, and confirmed right-skewed distributions justifying non-parametric methods throughout. Visualisation (Section 6) revealed that conversion is structurally concentrated in Ultra HNI clients, referral and existing member channels, and prospects with higher engagement touchpoints — confirming that the pipeline leakage is not random but predictable. Hypothesis testing (Section 7) formally confirmed that both Risk Readiness Score (Mann-Whitney p < 0.05) and lead source (Chi-squared p < 0.05, Cramér's V moderate effect) are statistically significant predictors of conversion. Correlation analysis (Section 8) identified num_interactions and risk_score as the two strongest numeric predictors of conversion probability, with net_worth showing weaker association than expected — wealth alone does not predict commitment. The logistic regression model (Section 9) achieved an AUC of 0.842, meaning it correctly ranks a converted above a non-converted prospect 84.2% of the time — strong enough for operational prospect scoring deployment. **Single recommendation:** HOC Capital Club should implement a data-driven prospect prioritisation system built on three criteria from this analysis — Risk Readiness Score above the median, a minimum of five documented interactions, and a referral or existing member lead source. Prospects meeting all three criteria should be assigned to senior relationship managers immediately and tracked on a weekly conversion dashboard. This operationalises findings from all five analytical techniques into one deployable pipeline management protocol. ::: # 11. Limitations & Further Work ::: {.warning-box} - Sample size of 100 is the minimum threshold — a larger dataset would improve model stability and reduce overfitting risk - In progress clients (37) were excluded from regression and hypothesis testing — their eventual outcome could shift findings once resolved - No revenue or deal value data — all conversions are treated equally regardless of investment size committed - Time period spans approximately 15 months — seasonal effects and market conditions cannot be fully assessed - Cross-sectional design — causal claims cannot be made; only associations can be reported - Further work: collect deal value at conversion to build a revenue-weighted model; track in-progress clients to completion; add a time-to-convert variable for survival analysis; A/B test referral incentive programmes informed by lead source findings ::: # References Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/ Adi, B. (2026). *Chapter 9: Exploratory Data Analysis.* In *AI-powered business analytics.* markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/04-eda.html Adi, B. (2026). *Chapter 10: Data Visualisation for Business.* In *AI-powered business analytics.* markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part1-exploration/05-visualisation.html Adi, B. (2026). *Chapter 11: Hypothesis Testing Fundamentals.* In *AI-powered business analytics.* markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part2-testing/06-hypothesis-testing.html Adi, B. (2026). *Chapter 13: Correlation and Association.* In *AI-powered business analytics.* markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part3-regression/08-correlation.html Adi, B. (2026). *Chapter 18: Logistic Regression.* In *AI-powered business analytics.* markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/part4-classification/13-logistic-regression.html Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048 Ikot, S. (2026). *HOC Capital Club HNI client enquiry records, January 2025 – March 2026* [Dataset]. Member Experience and Engagement Department, HOC Capital Club, Lagos, Nigeria. Data available on request from the author. R Core Team. (2024). *R: A language and environment for statistical computing*. R Foundation for Statistical Computing. https://www.R-project.org/ Wickham, H., et al. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686 Wickham, H., & Bryan, J. (2025). *readxl: Read Excel files* (R package version 1.4.5). https://CRAN.R-project.org/package=readxl Kassambara, A. (2023). *rstatix: Pipe-friendly framework for basic statistical tests* (R package version 0.7.2). https://CRAN.R-project.org/package=rstatix Robin, X., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. *BMC Bioinformatics, 12*, 77. https://doi.org/10.1186/1471-2105-12-77 Galili, T., et al. (2018). heatmaply: An R package for creating interactive cluster heatmaps. *Bioinformatics, 34*(9), 1600–1602. https://doi.org/10.1093/bioinformatics/btx657 ```{r} #| label: pkg-citations #| echo: false #| results: hide citation("tidyverse") citation("readxl") citation("lubridate") citation("rstatix") citation("pROC") citation("heatmaply") citation("broom") citation("skimr") ``` # Appendix: AI Usage Statement Claude (Anthropic) was used to assist with code generation and debugging during this analysis. All analytical decisions — technique selection, business interpretation, hypothesis formulation, coefficient interpretation, and the final recommendation — were made independently. The professional disclosure and data provenance sections reflect the author's own professional role and judgement at HOC Capital Club and were written without AI assistance. All insight box interpretations are the author's own analysis of the outputs produced. --- **GitHub Repository:** *(Create a public GitHub repository, push your Quarto.qmd and anonymised .xlsm file, and paste the URL here before submitting — this earns +5 bonus marks)*