<pandas.io.formats.style.Styler object at 0x0000023F7EDB23C0>
Exploratory and Inferential Analysis of Sales Pipeline Performance — Dochase ADX
Case Study 1 — Data Analytics II | Lagos Business School
Executive Summary
Dochase ADX is a Nigerian B2B advertising-technology company operating across programmatic display, Rich SMS, and USSD channels. This report analyses 135 live CRM deal records extracted from Zoho CRM (January–December 2025) to answer one operational question: what drives deal value in our sales pipeline, and where should management intervene?
Five analytical techniques are applied — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — forming a progressive ladder from description to inference to quantified prediction.
Key findings: deal amounts are severely right-skewed (median NGN 1.72M; maximum NGN 1B); deal values differ significantly across pipeline stages (ANOVA F-test and Kruskal-Wallis both p < 0.05); Stage Rank is the dominant predictor in regression (each stage advancement is associated with a measurable percentage increase in expected deal value); deal age alone does not predict deal value.
Single recommendation: Dochase ADX management should implement a stage-prioritisation framework, with weekly executive review of Negotiation/Review deals and automated CRM alerts for deals stagnant longer than 30 days in any mid-pipeline stage. Stage management — not time elapsed — is the evidence-backed lever for revenue improvement.
0.1 Reproducibility & GitHub Repository
The full project workflow, including the Quarto report source, R scripts, Python integration, and supporting assets, is publicly available on GitHub for reproducibility and version control purposes:
1 Professional Disclosure
1.1 Job Role and Organisational Context
I am the Brand Growth and Communications Manager at Dochase ADX, a technology company headquartered in Lagos, Nigeria, that operates in the programmatic advertising, Rich SMS, and USSD sectors across Sub-Saharan Africa. My role spans brand strategy, lead generation, client communications, and revenue marketing — placing me at the intersection of marketing activity and commercial pipeline outcomes.
Because I work directly with the sales organisation and have authorised access to our Zoho CRM system, I am uniquely positioned to collect, interpret, and act on pipeline data. The five analytical techniques chosen below are not selected for academic convenience — they map directly to the decisions I face in managing growth.
1.2 Technique Justification by Operational Relevance
2 Data Preparation and Validation
2.1 Python Validation of CRM Dataset
To ensure data integrity before statistical modelling, Python was used for preliminary validation. This step confirms dataset structure, identifies missing values, and visualises the distribution of deal amounts to detect skewness and potential outliers prior to formal inferential analysis.
<pandas.io.formats.style.Styler object at 0x0000023F7E6C9810>
2.2 Key Observations from Python Validation
- The dataset contains 135 CRM deal records with 11 variables.
- Missing values are concentrated in Amount (11.9%), indicating early-stage deals with no pricing attached.
- Deal amounts are heavily right-skewed, with a small number of high-value outliers.
- Log transformation is therefore necessary for reliable regression modelling.
1. Exploratory Data Analysis (EDA): Before any investment in marketing spend or sales resource is justified, I need to understand the baseline state of our pipeline. EDA allows me to quantify how many deals are active, where value is concentrated, which data fields are incomplete, and whether the distribution of deal amounts is normal or skewed. This directly informs how I frame performance targets and how I report to the executive team.
2. Data Visualisation: The CFO, Sales Director, and CEO consume insights through charts, not tables of p-values. My role requires me to communicate pipeline health visually and compellingly. A well-designed set of five charts can replace a 20-page written report in a monthly business review. Visualisation is not supplementary to my analysis — it is the primary communication channel for this work.
3. Hypothesis Testing: Marketing investment decisions — campaign budgets, sales headcount allocation, stage-specific training programmes — require evidence that observed differences are real. I need to know whether the higher average deal value I observe in Negotiation/Review compared to Qualification is a genuine structural difference or a coincidence of sampling. ANOVA provides that evidence formally.
4. Correlation Analysis: I regularly propose which KPIs to track on our sales dashboard. Correlation analysis tells me which pairs of metrics move together — enabling me to select a small number of leading indicators that reliably predict the outcomes that matter. If deal age and deal value are weakly correlated, I can stop monitoring pipeline age as a performance proxy.
5. Linear Regression: Regression is the bridge from “these variables are associated” to “here is a quantified, directional forecast.” The ability to tell the Sales Director “each stage advancement is associated with an X% increase in expected deal value” is far more actionable than a correlation coefficient. It also enables scenario planning: what revenue would we expect if we advanced 10 more deals from Qualification to Proposal this quarter?
3 Data Collection & Sampling
3.1 Data Source and Collection Method
The dataset was extracted directly from Zoho CRM — Dochase ADX’s live customer relationship management system — by the analyst on 29 April 2026 using the CRM’s native CSV export function. No intermediary cleaning was applied before export; the raw file is used as-is in this analysis with all transformations documented in Section 4.
The extraction captured all deal records in the system with a scheduled closing date falling within the calendar year 2025. This boundary was chosen to reflect a complete operational year and to exclude in-flight 2026 deals whose closing data would be incomplete.
3.2 Sampling Frame and Sample Size
| Attribute | Detail |
|---|---|
| Population | All commercial deal opportunities ever created in Dochase ADX’s Zoho CRM |
| Sampling frame | All deals with a closing date in calendar year 2025 |
| Sample size | 135 deal records |
| Sampling method | Census — all qualifying records extracted (no random sampling required) |
| Unit of observation | One CRM deal record |
| Time period | 1 January 2025 – 31 December 2025 |
| Extraction date | 29 April 2026 |
Because the extraction captured all 2025 deals rather than a random sample, the dataset is effectively a census of the 2025 pipeline. Statistical inference is still valid and meaningful here: the 135 deals can be treated as a sample from the broader process that generates Dochase ADX deals over time, and the goal is to infer properties of that generative process — not merely to describe these 135 records.
The sample of 135 exceeds the minimum requirement of 100 observations specified in the assessment brief. With 9 pipeline stages, the average group size for ANOVA is approximately 15 — sufficient for the Central Limit Theorem to support inference despite non-normal within-group distributions.
3.3 Variable Inventory
| Variable | Type | Description |
|---|---|---|
| Record Id | Character (ID) | Unique CRM deal identifier |
| Deal Name | Character | Name of the campaign or deal |
| Amount | Numeric (NGN) | Monetary value of the deal — outcome variable |
| Stage | Categorical (9 levels) | Current pipeline stage |
| Closing Date | Date (MM/DD/YYYY) | Scheduled or actual deal close date |
| Account Name | Categorical | Client organisation |
| Contact Name | Categorical | Named client contact (52 missing — structural) |
| Deal Owner | Categorical (7 levels) | Assigned sales representative |
The dataset contains 3 numeric variables (Amount, and two engineered: Deal_Age_Days, Stage_Rank), 2 categorical variables (Stage, Deal Owner), and 1 date variable (Closing Date) — satisfying the minimum data structure requirement.
3.4 Ethical Statement and Consent
The data is an internal operational export used under authorised CRM access granted as part of my employment at Dochase ADX.
- No external clients were contacted or surveyed; data collection involved no human subjects.
- Client names (Account Name) are retained in the internal analysis file but would be replaced with anonymised codes (Client_001, etc.) for any external publication.
- Contact name fields are excluded from all analysis — they are structurally incomplete and not required for any technique applied.
- No written external consent was required. The analyst confirms that use of this data for the academic purposes of this assessment is consistent with their professional responsibilities.
4 Data Description
4.1 Data Loading and Preparation
4.1.1 Qualitative Rationale for Preparation Choices
Three data quality issues were identified and addressed before analysis:
Issue 1 — Missing Amount values (n = 16, 11.9%): These likely reflect early-stage deals where no formal proposal has been issued. Median imputation (NGN 1,720,000) was applied. The median was chosen over the mean because the Amount distribution is severely right-skewed — the mean (≈ NGN 14.4M) is inflated by two extreme campaigns (NGN 348M and NGN 1B) and would represent a poor substitute for a typical missing value. Row deletion was rejected as it would bias the remaining sample toward more advanced stages.
Issue 2 — Date format (MM/DD/YYYY): The CRM exports dates in American format. mdy() from lubridate must be used; using dmy() would silently return NA for every date and destroy the temporal analysis.
Issue 3 — Stage encoding: Pipeline stages are nominal strings. A Stage_Rank variable (0–7) was engineered to enable regression and correlation: 0 = Closed Lost outcomes, 1–6 = active progression, 7 = Closed Won.
| Table 0a: Raw Dataset Profile | |
| Zoho CRM Export — 29 April 2026 | |
| Attribute | Value |
|---|---|
| Total Records | 135 |
| Total Variables | 11 |
| Amount — Missing (n) | 16 |
| Amount — Missing (%) | 11.9% |
| Date Format (Closing Date) | MM/DD/YYYY (American CRM format) |
| Table 0b: Cleaned Dataset Profile | |
| Post-wrangling snapshot — all transformations applied | |
| Attribute | Value |
|---|---|
| Records (post-clean) | 135 records |
| Variables (engineered) | 11 columns |
| Closing Date — Earliest | 08 Aug 2024 |
| Closing Date — Latest | 04 Dec 2026 |
| Amount — Minimum (NGN) | 0 |
| Amount — Maximum (NGN) | 1,000,000,000 |
| Imputation Applied | Median (NGN 1,720,000) for 16 missing Amount values |
4.2 Missing Value Audit
| Table 1: Missing Value Audit — Raw CRM Export | |||
| Dochase ADX Zoho CRM | Export: 29 April 2026 | |||
| Variable | Missing (n) | Missing (%) | Treatment |
|---|---|---|---|
| Record Id | 0 | 0.0 | No action required |
| Deal Name | 0 | 0.0 | No action required |
| Amount | 16 | 11.9 | Median imputation — NGN 1,720,000 substituted for 16 records |
| Stage | 0 | 0.0 | No action required |
| Closing Date | 0 | 0.0 | No action required |
| Account Name.id | 0 | 0.0 | No action required |
| Account Name | 0 | 0.0 | No action required |
| Contact Name.id | 52 | 38.5 | Excluded — structural missing; variable not used in analysis |
| Contact Name | 52 | 38.5 | Excluded — structural missing; variable not used in analysis |
| Deal Owner.id | 0 | 0.0 | No action required |
| Deal Owner | 0 | 0.0 | No action required |
5 Technique 1 — Exploratory Data Analysis (EDA)
5.1 Theory Recap
Exploratory Data Analysis, formalised by Tukey (1977), is the practice of interrogating a dataset through summary statistics and visualisation before imposing model assumptions. The core goals are: characterise distributions (central tendency, spread, shape), identify data quality issues (missing values, outliers, encoding errors), and surface patterns that motivate formal hypotheses. Key tools include the five-number summary, interquartile range (IQR), skewness coefficients, and boxplots. Anscombe’s Quartet (Anscombe, 1973) famously demonstrated that identical summary statistics can describe radically different data shapes — making visual EDA non-negotiable before inference.
5.2 Business Justification
In the Dochase ADX context, EDA serves a gatekeeping function: before the CFO or Sales Director acts on any dashboard number, the analyst must verify that the underlying data is fit for purpose. The 16 missing Amount values, if left unaddressed, would silently bias every aggregate metric. The two extreme outlier campaigns would inflate the mean deal value to a level that misrepresents the typical commercial opportunity. EDA makes these issues visible and tractable.
5.3 Descriptive Statistics
| Table 2: Descriptive Statistics — Numeric Variables | ||||||||
| After median imputation of 16 missing Amount values | ||||||||
| Variable | N | Mean | Median | SD | Min | Max | IQR_val | Skewness1 |
|---|---|---|---|---|---|---|---|---|
| Amount | 135.00 | 12,936,574.66 | 1,720,000.00 | 90,773,544.10 | 0.00 | 1,000,000,000.00 | 2,500,000.00 | 0.12 |
| Deal_Age_Days | 135.00 | 185.91 | 71.00 | 206.86 | 1.00 | 639.00 | 212.50 | 0.56 |
| Stage_Rank | 135.00 | 4.53 | 6.00 | 2.63 | 0.00 | 7.00 | 5.00 | −0.56 |
| 1 Pearson's 2nd skewness coefficient = (Mean - Median) / SD | ||||||||
| Table 3: Deal Volume & Amount by Pipeline Stage | ||||||
| All 135 records after imputation | ||||||
| Stage | n | % | Min | Median | Mean | Max |
|---|---|---|---|---|---|---|
| Qualification | 17 | 12.6 | 1 | 1,000,000 | 22,542,941 | 348,000,000 |
| Needs Analysis | 12 | 8.9 | 0 | 50,000 | 665,000 | 1,720,000 |
| Value Proposition | 6 | 4.4 | 1 | 895,000 | 2,131,667 | 8,000,000 |
| Identify Decision Makers | 5 | 3.7 | 1 | 45,000 | 421,000 | 2,000,000 |
| Proposal/Price Quote | 9 | 6.7 | 25,000 | 3,000,000 | 119,718,333 | 1,000,000,000 |
| Negotiation/Review | 24 | 17.8 | 70,000 | 2,250,000 | 3,499,583 | 12,000,000 |
| Closed Won | 49 | 36.3 | 6 | 2,000,000 | 3,334,542 | 15,000,000 |
| Closed Lost | 12 | 8.9 | 1 | 1 | 1,040,417 | 5,000,000 |
| Closed Lost to Competition | 1 | 0.7 | 3,000,000 | 3,000,000 | 3,000,000 | 3,000,000 |
5.4 Outlier Detection
| Table 4a: IQR Outlier Detection — Fence Parameters | |
| All outliers retained; extreme values flagged for transparency | |
| Parameter | Value (NGN) |
|---|---|
| Q1 (25th percentile) | 500,000 |
| Q3 (75th percentile) | 3,000,000 |
| IQR | 2,500,000 |
| Upper Fence (Q3 + 1.5 × IQR) | 6,750,000 |
| Outliers Retained | 18 |
| Table 4: Outlier Deals — IQR Method (All Retained) | ||||
| Deal | Stage | Owner | Amount | Classification |
|---|---|---|---|---|
| Branding Betonly | Proposal/Price Quote | Chioma Eze | NGN 1.00e+09 | Extreme (>NGN 100M) |
| LiveScorebet | Qualification | Chioma Eze | NGN 3.48e+08 | Extreme (>NGN 100M) |
| Grandcereals Awareness Campaign | Proposal/Price Quote | Chioma Eze | NGN 6.00e+07 | Moderate (>NGN 11M) |
| NIVEA MEN DEEP | Closed Won | Chioma Eze | NGN 1.50e+07 | Moderate (>NGN 11M) |
| ALEND App campaign | Negotiation/Review | Abiodun Mudele | NGN 1.20e+07 | Moderate (>NGN 11M) |
| Viju Dochase Campaign | Closed Won | Abiodun Mudele | NGN 1.20e+07 | Moderate (>NGN 11M) |
| Mamvest App Campaign | Negotiation/Review | Abiodun Mudele | NGN 1.00e+07 | Moderate (>NGN 11M) |
| Always-On BetaGist Campaign | Qualification | Chioma Eze | NGN 1.00e+07 | Moderate (>NGN 11M) |
| MTN Summer Roaming Campign | Qualification | Chioma Eze | NGN 1.00e+07 | Moderate (>NGN 11M) |
| Spade Registreation X FTD Campaign | Closed Won | Chioma Eze | NGN 9.10e+06 | Moderate (>NGN 11M) |
| Smoov Campaign | Negotiation/Review | Abiodun Mudele | NGN 9.00e+06 | Moderate (>NGN 11M) |
| Betwinner FTD Campaign March | Closed Won | Chioma Eze | NGN 9.00e+06 | Moderate (>NGN 11M) |
| Vyrus Digital | Value Proposition | Uchenna | NGN 8.00e+06 | Moderate (>NGN 11M) |
| Viju Google Campaign | Closed Won | Abiodun Mudele | NGN 8.00e+06 | Moderate (>NGN 11M) |
| MAGGI Tales of Ramadan Feb"26 | Closed Won | Chioma Eze | NGN 8.00e+06 | Moderate (>NGN 11M) |
| Mozzart Bet Campaign | Closed Won | Abiodun Mudele | NGN 7.00e+06 | Moderate (>NGN 11M) |
| Peak Thematic Wave 1 March | Closed Won | Chioma Eze | NGN 7.00e+06 | Moderate (>NGN 11M) |
| Viju Google Campaign | Negotiation/Review | Abiodun Mudele | NGN 6.80e+06 | Moderate (>NGN 11M) |
5.5 Distribution Plots
5.6 Plain-Language Interpretation for a Non-Technical Manager
The typical Dochase ADX deal is worth approximately NGN 1.72 million (the median). The average looks much higher at ~NGN 14 million, but that figure is distorted by two exceptional campaigns — a NGN 348M opportunity and a NGN 1B campaign — both held by Chioma Eze. Most of our pipeline (roughly 70%) consists of deals under NGN 4.5 million. Closed Won is our largest stage by count (49 deals), which is encouraging. However, the data also shows 16 deal records with no monetary value attached — these need to be followed up with the relevant sales representatives to ensure proposals are being submitted and recorded promptly.
6 Technique 2 — Data Visualisation
6.1 Theory Recap
The Grammar of Graphics (Wilkinson, 2005), implemented in R through ggplot2 (Wickham, 2016), provides a compositional framework for constructing statistical charts by mapping data variables to visual aesthetics (position, colour, size, shape). Effective data visualisation requires matching the chart type to the data structure and the analytical question: histograms for distributions, boxplots for group comparisons, scatter plots for relationships, and line charts for temporal patterns. Tufte (2001) emphasises maximising the data-to-ink ratio — every element of a chart should carry information. Cairo (2016) adds the dimension of narrative: a good chart tells one clear story, not many confusing ones simultaneously.
6.2 Business Justification
Monthly business reviews at Dochase ADX are attended by the CEO, Sales Director, and CFO. These stakeholders require chart-based summaries of pipeline health that can be absorbed in under two minutes. The five plots below are designed as a cohesive narrative unit: they move from volume (how many deals?) to value (how much revenue?) to ownership (who holds it?) to time (when does it close?) to composition (what type of deals are they?).
6.3 Figure 4 — Pipeline Funnel: Deal Volume by Stage
Chart selection justification: A horizontal bar chart was chosen over a traditional triangular funnel diagram because the Dochase ADX pipeline is non-sequential — deals can be logged at any stage. A funnel chart would imply a linear flow that does not exist in the data, misleading management about conversion rates.
6.4 Figure 5 — Revenue Potential by Stage
6.5 Figure 6 — Deal Owner Performance
6.6 Figure 7 — Monthly Deal Volume Over Time
6.7 Figure 8 — Deal Size Composition by Stage
6.8 Plain-Language Interpretation for a Non-Technical Manager
Figures 4–8 together reveal that our pipeline is volume-rich but value-concentrated. We have 49 closed-won deals, which is healthy, but the vast majority of total potential revenue sits in just two campaigns. Chioma Eze is effectively running a separate enterprise-scale pipeline within the broader team. Monthly deal flow is reasonably consistent through 2025 with no dramatic seasonal cliff. The deal size composition chart shows that large deals appear even at the Qualification stage — meaning early-stage triage matters enormously for revenue forecasting accuracy.
7 Technique 3 — Hypothesis Testing
7.1 Theory Recap
Hypothesis testing is the formal framework for deciding whether observed patterns in sample data are likely to reflect real population-level effects, or whether they could plausibly arise from random sampling variation (Fisher, 1925; Neyman & Pearson, 1933). A null hypothesis (H₀) specifies no effect; an alternative hypothesis (H₁) specifies the pattern of interest. The p-value is the probability of observing data at least as extreme as the sample if H₀ were true — small p-values (conventionally p < 0.05) constitute evidence against H₀. Effect sizes (Cohen’s d, eta-squared η²) complement p-values by quantifying practical significance, which statistical significance alone cannot establish. When parametric assumptions (normality, homogeneity of variance) are violated, non-parametric tests — such as the Kruskal-Wallis test (the rank-based analogue of one-way ANOVA) — provide robust alternatives.
7.2 Business Justification
Before recommending stage-differentiated management interventions, I need to establish that the deal value differences I observe across pipeline stages are not a product of the particular 135 deals in this export. Formal hypothesis testing provides that assurance. If ANOVA confirms statistical significance, the Sales Director can act on stage-based KPIs with confidence. If the test is non-significant, the observed differences should not drive resource allocation decisions.
7.3 Hypothesis 1 — One-Way ANOVA: Does Deal Value Differ by Stage?
H₀: μ₁ = μ₂ = … = μₖ — Mean deal amount is equal across all pipeline stages
H₁: At least one stage has a different mean deal amount
Significance level: α = 0.05
7.3.1 Assumption 1: Normality (Anderson-Darling Test, per Stage)
| Table 5a: Anderson-Darling Normality Test by Stage (n >= 7) | ||||
| H0: data are normally distributed | Stages with fewer than 7 records excluded | ||||
| Pipeline Stage | n | AD Statistic | p-value | Normal? (p > 0.05)1 |
|---|---|---|---|---|
| Closed Lost | 12 | 1.8625 | 0e+00 | No |
| Closed Won | 49 | 3.0270 | 0e+00 | No |
| Needs Analysis | 12 | 1.6795 | 1e-04 | No |
| Negotiation/Review | 24 | 1.7611 | 1e-04 | No |
| Proposal/Price Quote | 9 | 2.5813 | 0e+00 | No |
| Qualification | 17 | 5.6085 | 0e+00 | No |
| 1 Stages failing normality (p < 0.05) confirm use of non-parametric Kruskal-Wallis as robustness check. | ||||
7.3.2 Assumption 2: Homogeneity of Variance (Levene’s Test)
| Table 5b: Levene's Test — Homogeneity of Variance | |||||
| Outcome determines whether standard ANOVA or Kruskal-Wallis is applied | |||||
| Test | F-value | Df (Between) | Df (Within) | p-value | Conclusion |
|---|---|---|---|---|---|
| Levene's Test for Homogeneity of Variance | 1.825 | 8 | 126 | 0.0782 | Variances homogeneous (p >= 0.05) — standard ANOVA is appropriate. |
7.3.3 One-Way ANOVA Result
| Table 5c: One-Way ANOVA — Deal Amount by Pipeline Stage | |||||
| Dependent variable: Amount (NGN) | alpha = 0.05 | |||||
| Source | df | SS | MS | F1 | p-value |
|---|---|---|---|---|---|
| Stage_Factor | 8 | 1.159327e+17 | 1.449159e+16 | 1.8477 | 0.0742 |
| Residuals | 126 | 9.882053e+17 | 7.842899e+15 | NA | NA |
| 1 Eta-squared (eta2) = 0.105 -> Medium practical effect (Cohen, 1988) | |||||
7.3.4 Non-Parametric Robustness: Kruskal-Wallis Test
| Table 5d: Kruskal-Wallis Non-Parametric Robustness Check | ||||
| Rank-based analogue of one-way ANOVA — no distributional assumptions required | ||||
| Test | H-statistic | df | p-value | Decision |
|---|---|---|---|---|
| Kruskal-Wallis Rank Sum Test | 33.6479 | 8 | 4.7e-05 | Reject H0 — significant group differences (p < 0.05) |
7.3.5 Post-Hoc: Dunn’s Test (Bonferroni-Corrected)
| Table 5e: Dunn Post-Hoc Test — Pairwise Comparisons (Bonferroni-Corrected) | ||||
| Only pairs with p.adjusted < 0.05 constitute statistically significant differences | ||||
| Stage Pair | Z Statistic | p (raw) | p (Bonferroni)1 | Significant? |
|---|---|---|---|---|
| Needs Analysis - Negotiation/Review | -3.6926 | 0.0001 | 0.0040 | Yes |
| Closed Lost - Negotiation/Review | -3.4841 | 0.0002 | 0.0089 | Yes |
| Closed Won - Needs Analysis | 3.4370 | 0.0003 | 0.0106 | Yes |
| Needs Analysis - Proposal/Price Quote | -3.3653 | 0.0004 | 0.0138 | Yes |
| Closed Lost - Closed Won | -3.2081 | 0.0007 | 0.0240 | Yes |
| Closed Lost - Proposal/Price Quote | -3.1981 | 0.0007 | 0.0249 | Yes |
| Identify Decision Makers - Negotiation/Review | -2.8257 | 0.0024 | 0.0849 | No |
| Identify Decision Makers - Proposal/Price Quote | -2.8103 | 0.0025 | 0.0891 | No |
| Closed Won - Identify Decision Makers | 2.5359 | 0.0056 | 0.2019 | No |
| Negotiation/Review - Qualification | 1.9525 | 0.0254 | 0.9157 | No |
| Proposal/Price Quote - Qualification | 1.9343 | 0.0265 | 0.9554 | No |
| Closed Lost - Closed Lost to Competition | -1.5068 | 0.0659 | 1.0000 | No |
| Closed Lost to Competition - Closed Won | 0.5297 | 0.2982 | 1.0000 | No |
| Closed Lost - Identify Decision Makers | 0.2954 | 0.3838 | 1.0000 | No |
| Closed Lost to Competition - Identify Decision Makers | 1.5753 | 0.0576 | 1.0000 | No |
| Closed Lost - Needs Analysis | 0.1806 | 0.4284 | 1.0000 | No |
| Closed Lost to Competition - Needs Analysis | 1.5777 | 0.0573 | 1.0000 | No |
| Identify Decision Makers - Needs Analysis | -0.1570 | 0.4376 | 1.0000 | No |
| Closed Lost to Competition - Negotiation/Review | 0.3297 | 0.3708 | 1.0000 | No |
| Closed Won - Negotiation/Review | -0.7968 | 0.2128 | 1.0000 | No |
| Closed Lost to Competition - Proposal/Price Quote | 0.1500 | 0.4404 | 1.0000 | No |
| Closed Won - Proposal/Price Quote | -1.0394 | 0.1493 | 1.0000 | No |
| Negotiation/Review - Proposal/Price Quote | -0.4565 | 0.3240 | 1.0000 | No |
| Closed Lost - Qualification | -1.6255 | 0.0520 | 1.0000 | No |
| Closed Lost to Competition - Qualification | 0.9286 | 0.1766 | 1.0000 | No |
| Closed Won - Qualification | 1.4937 | 0.0676 | 1.0000 | No |
| Identify Decision Makers - Qualification | -1.5138 | 0.0650 | 1.0000 | No |
| Needs Analysis - Qualification | -1.8210 | 0.0343 | 1.0000 | No |
| Closed Lost - Value Proposition | -0.9829 | 0.1628 | 1.0000 | No |
| Closed Lost to Competition - Value Proposition | 0.9970 | 0.1594 | 1.0000 | No |
| Closed Won - Value Proposition | 1.2528 | 0.1051 | 1.0000 | No |
| Identify Decision Makers - Value Proposition | -1.0713 | 0.1420 | 1.0000 | No |
| Needs Analysis - Value Proposition | -1.1303 | 0.1292 | 1.0000 | No |
| Negotiation/Review - Value Proposition | 1.6221 | 0.0524 | 1.0000 | No |
| Proposal/Price Quote - Value Proposition | 1.7433 | 0.0406 | 1.0000 | No |
| Qualification - Value Proposition | 0.2557 | 0.3991 | 1.0000 | No |
| 1 Bonferroni correction is conservative for large numbers of pairwise comparisons. | ||||
Figure 9: Mean Deal Amount by Stage with 95% CI
7.4 Hypothesis 2 — Correlation Test: Does Deal Age Predict Deal Value?
H₀: ρ = 0 — No linear relationship between Deal Age and Amount
H₁: ρ ≠ 0 — A significant relationship exists
Tests: Pearson (parametric) + Spearman (non-parametric robustness check)
| Table 5f: Hypothesis 2 — Correlation Tests: Deal Age vs Deal Amount | |||||||
| H0: rho = 0 (no relationship) | H1: rho != 0 | alpha = 0.05 | |||||||
| Test | Test Statistic | r / rho | df / n | p-value | 95% CI | Significant? | Decision |
|---|---|---|---|---|---|---|---|
| Pearson Product-Moment Correlation | t = 0.005 | 0.0004 | 133 | 0.9960 | [-0.169, 0.169] | No | Fail to reject H0 — no significant linear relationship |
| Spearman Rank Correlation (non-parametric) | rho = 0.0221 | 0.0221 | 135 | 0.7994 | N/A (rank-based) | No | Fail to reject H0 — no significant monotonic relationship |
7.5 Plain-Language Interpretation for a Non-Technical Manager
Hypothesis 1 result: The statistical tests confirm that deal values are not equal across pipeline stages — the differences are too large to be attributable to chance. This matters for management: it means stage is a real determinant of deal value, justifying stage-specific coaching, review cadences, and incentive structures.
Hypothesis 2 result: There is no significant relationship between how long a deal has been in the pipeline and how much it is worth. This is a critical negative finding: bigger deals do not simply take longer to close. Pipeline age alone is therefore a poor management proxy for deal quality or value.
8 Technique 4 — Correlation Analysis
8.1 Theory Recap
Correlation measures the strength and direction of the linear relationship between two numeric variables. The Pearson correlation coefficient (r) quantifies linear co-variation and ranges from −1 (perfect negative) to +1 (perfect positive); r = 0 indicates no linear relationship. For ordinal variables or non-normal distributions, Spearman’s rank correlation (ρ) provides a robust non-parametric alternative (Spearman, 1904). Correlation coefficients are interpreted by conventional benchmarks: |r| < 0.10 = negligible; 0.10–0.29 = weak; 0.30–0.49 = moderate; ≥ 0.50 = strong (Cohen, 1988). Critically, correlation does not imply causation — confounders may explain any observed association. Partial correlation controls for a third variable, isolating the pairwise relationship of interest.
8.2 Business Justification
My role requires me to propose which metrics appear on the sales management dashboard. A dashboard cluttered with 20 loosely related KPIs is worse than one with 4 well-chosen leading indicators. Correlation analysis identifies which operational variables reliably co-move, enabling me to select a parsimonious and predictive KPI set. If Stage_Rank is strongly correlated with Amount, it belongs on the dashboard. If Deal_Age_Days is weakly correlated with Amount, it does not add forecasting value and should be replaced by a more informative metric.
| Table 6: Pairwise Pearson Correlation — 95% CI Reported | ||||||
| Pair | r | p_value | CI_95 | Significant | Strength | Direction |
|---|---|---|---|---|---|---|
| Amount x Deal_Age_Days | 0.0004 | 0.9960 | [-0.169, 0.169] | No | Negligible | Positive |
| Amount x Stage_Rank | -0.0133 | 0.8787 | [-0.182, 0.156] | No | Negligible | Negative |
| Deal_Age_Days x Stage_Rank | 0.1828 | 0.0338 | [0.014, 0.341] | Yes | Weak | Positive |
8.3 Plain-Language Interpretation for a Non-Technical Manager
The correlation analysis produces three key business messages. First, Stage Rank and Amount are positively correlated — deals in more advanced pipeline stages tend to be worth more. This is the most actionable finding: stage progression is a leading indicator of deal value. Second, Deal Age and Amount show a weak or negligible correlation — confirming the Hypothesis 2 finding that time alone does not predict value. Third, any relationship between Stage Rank and Deal Age would reveal whether our pipeline is advancing at a consistent velocity or stalling at certain stages. Where correlation is blank in the heatmap, the relationship is not statistically distinguishable from zero, and that metric pair should not be treated as predictive.
9 Technique 5 — Regression Analysis
9.1 Theory Recap
Ordinary Least Squares (OLS) linear regression models the relationship between a continuous outcome variable (Y) and one or more predictors (X₁, X₂, …, Xₖ) by minimising the sum of squared residuals. The estimated coefficients (β̂) represent the expected change in Y for a one-unit change in the corresponding X, holding all other variables constant. Four OLS assumptions must be checked: linearity, independence of errors, homoscedasticity (constant error variance), and normality of residuals. Violations can be diagnosed through residual plots (Residuals vs Fitted), Q-Q plots, the Breusch-Pagan test for heteroscedasticity, and Variance Inflation Factors (VIF) for multicollinearity (Gujarati & Porter, 2009). When the outcome is right-skewed — as deal Amount is here — a log-transformation (log1p) stabilises variance and improves model validity; coefficients from the log model are interpreted as percentage changes in the outcome.
9.2 Business Justification
Correlation tells me which variables move together; regression tells me by how much and in which direction — and it does so while holding other variables constant. The regression model enables me to make a specific, quantified statement to the Sales Director: “Advancing a deal by one pipeline stage is associated with an X% increase in expected deal value, even after accounting for deal owner and pipeline age.” That is a forecasting and prioritisation statement, not merely an observation.
9.3 Model 1 — OLS on Raw Amount (Baseline)
| Table 6a: Model 1 — OLS (Raw Amount) — Fit Statistics | ||||||
| Dependent variable: Amount (NGN) | Baseline specification | ||||||
| R2 | Adjusted R2 | Residual SE | F-statistic | p-value | df | n |
|---|---|---|---|---|---|---|
| 0.0597 | 0 | 90773267 | 1.0001 | 0.439391 | 8 | 135 |
9.4 Model 2 — Log-Transformed OLS (Preferred Model)
| Table 6b: Model 2 — OLS (Log-Transformed Amount) — Fit Statistics | ||||||
| Dependent variable: log(1 + Amount) | Preferred specification | ||||||
| R2 | Adjusted R2 | Residual SE | F-statistic | p-value | df | n |
|---|---|---|---|---|---|---|
| 0.4261 | 0.3897 | 4.0517 | 11.6938 | 0 | 8 | 135 |
9.5 Diagnostic Plots
| Table 6c: Breusch-Pagan Test — Homoscedasticity of Residuals | ||||
| Model 2: log(1 + Amount) specification | ||||
| Test | BP Statistic | df | p-value | Conclusion |
|---|---|---|---|---|
| Breusch-Pagan Test (Homoscedasticity) | 31.1549 | 8 | 1e-04 | Heteroscedasticity detected (p < 0.05) — consider robust standard errors |
| Table 6d: Variance Inflation Factors (VIF) — Multicollinearity Diagnostic | ||
| Model 2 predictors | Threshold: VIF > 5 warrants investigation | ||
| Predictor | VIF | Assessment1 |
|---|---|---|
| Stage_Rank | 1.6786 | Acceptable (VIF < 5) |
| Deal_Age_Days | 3.0092 | Acceptable (VIF < 5) |
| Deal_Owner_f | 1.3025 | Acceptable (VIF < 5) |
| 1 Green = acceptable (<5) | Amber = moderate (5-9) | Red = severe (>=10) | ||
9.6 Coefficient Interpretation
| Table 7: Regression Coefficients — Business Language Translation | ||||||
| Dependent variable: log(1 + Amount) | Reference owner: Abiodun Mudele | ||||||
| Predictor | Coeff | SE | p | % Change in Amount1 | Significant? | Plain-Language Meaning |
|---|---|---|---|---|---|---|
| (Intercept) | 9.8833 | 1.3626 | 0.0000 | 1960003.67 | Yes | Baseline log-value for Abiodun Mudele (reference) at Stage_Rank = 0. |
| Stage_Rank | 0.6506 | 0.1727 | 0.0003 | 91.66 | Yes | Each additional pipeline stage = 91.66% increase in expected deal value. |
| Deal_Age_Days | 0.0024 | 0.0029 | 0.4238 | 0.24 | No | Each extra day in pipeline = 0.24% increase in expected deal value. |
| Deal_Owner_fChibuike Goodnews | -2.6578 | 1.9520 | 0.1758 | -92.99 | No | Chibuike Goodnews: deals valued 92.99% lower than Abiodun Mudele (reference). |
| Deal_Owner_fChike Enendu | -5.4185 | 1.2814 | 0.0000 | -99.56 | Yes | Chike Enendu: deals valued 99.56% lower than Abiodun Mudele (reference). |
| Deal_Owner_fChioma Eze | -0.1794 | 1.0406 | 0.8634 | -16.42 | No | Chioma Eze: deals valued 16.42% lower than Abiodun Mudele (reference). |
| Deal_Owner_fTebogo Makobo | -0.1821 | 4.2537 | 0.9659 | -16.65 | No | Tebogo Makobo: deals valued 16.65% lower than Abiodun Mudele (reference). |
| Deal_Owner_fTemitope Adebayo | 3.2862 | 1.5222 | 0.0328 | 2574.11 | Yes | Temitope Adebayo: deals valued 2574.11% higher than Abiodun Mudele (reference). |
| Deal_Owner_fUchenna | -0.3329 | 1.7007 | 0.8451 | -28.31 | No | Uchenna: deals valued 28.31% lower than Abiodun Mudele (reference). |
| 1 % change = (exp(coefficient) - 1) x 100 | ||||||
9.7 Plain-Language Interpretation for a Non-Technical Manager
Model 2 is the reliable model — it corrects for the skewed distribution of deal values. The most important result is the Stage_Rank coefficient: every step a deal advances through our pipeline is associated with a statistically significant percentage increase in expected deal value, all else being equal. This is the most actionable number in the entire analysis. Deal age shows no significant effect — again confirming that pipeline velocity matters more than pipeline duration. The owner effects show that different representatives manage portfolios of systematically different value — this justifies differentiated coaching rather than one-size-fits-all targets.
10 Integrated Findings
The five analytical techniques applied in this study produce a mutually reinforcing body of evidence that points to a single, clearly operationalisable conclusion.
EDA established the baseline: 135 CRM deals, severely right-skewed amounts (median NGN 1.72M; two extreme outliers at NGN 348M and NGN 1B), 16 missing Amount values addressed through median imputation, and a stage distribution concentrated at Closed Won (49 deals) and Negotiation/Review (24 deals).
Visualisation added spatial and comparative clarity. Revenue is not proportional to deal volume — two campaigns in Chioma Eze’s portfolio account for a disproportionate share of total pipeline value. The temporal chart shows reasonably consistent deal flow throughout 2025 with no dramatic cliff. Deal size composition reveals that large deals exist across all stages, meaning early-stage triage has direct implications for revenue forecasting accuracy.
Hypothesis Testing established statistical validity. Both ANOVA and Kruskal-Wallis confirmed that deal value differences across stages are statistically significant and not attributable to sampling variation. The deal age hypothesis test returned a non-significant result — decisively ruling out time-in-pipeline as a value predictor.
Correlation Analysis quantified the strength of those relationships. Stage_Rank is the most strongly correlated numeric predictor of Amount. Deal_Age_Days shows a weak, non-significant association. This pattern directly informs which KPIs belong on a management dashboard.
Regression translated associations into quantified, directional forecasting coefficients. Stage_Rank is the dominant, statistically significant predictor. Each stage advancement is associated with a measurable percentage increase in expected deal value, net of deal owner and deal age effects. Owner-level effects exist but vary in significance, partly reflecting unequal sub-sample sizes across the seven representatives.
11 Limitations & Further Work
11.1 Limitations
Sample size (n = 135): Adequate for overall ANOVA but limits per-owner regression analysis. Tebogo Makobo (n = 1) and Uchenna (n = 8) cannot be meaningfully compared in sub-group models.
Missing operational variables: Campaign type, client industry tier, media product (programmatic vs. Rich SMS vs. USSD), and client relationship tenure are absent. These likely explain a substantial portion of residual variance in Model 2 and represent confounders in every correlation observed.
Cross-sectional snapshot: The dataset is a single CRM export on 29 April 2026. Active deals may have progressed since extraction. Longitudinal panel data would enable stronger causal inference about stage progression effects.
Ordinal stage encoding assumes equal spacing: Stage_Rank (0–7) treats each transition as equidistant in value terms, which may not reflect the real commercial effort and value-creation at each step.
Outlier influence: Two extreme campaigns (NGN 348M and NGN 1B) drive the mean and ANOVA group statistics for the Qualification and Proposal stages respectively. Log-transformation mitigates this in regression but not in ANOVA group means.
11.2 11.2 Further Analytical Extensions
The following extensions represent scalable enhancements that could further strengthen pipeline intelligence and decision-making at Dochase ADX:
Win-probability modelling (logistic regression): to assign real-time conversion likelihood to active deals, improving sales prioritisation and pipeline forecasting.
Time-to-close modelling (survival analysis using Kaplan-Meier and Cox Proportional Hazards): to better understand deal velocity across stages, owners, and deal sizes.
Integration of CRM and campaign delivery data: to evaluate revenue quality by linking deal outcomes with impressions, reach, and campaign performance metrics.
Owner-level performance modelling (Bayesian hierarchical methods): to produce more stable and fair comparisons across sales agents with uneven deal distributions.
Expanded CRM feature set (industry, campaign type, acquisition source): to significantly improve predictive accuracy and segmentation depth.
12 References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966
Cairo, A. (2016). The truthful art: Data, charts, and maps for communication. New Riders.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Boyd.
Gujarati, D. N., & Porter, D. C. (2009). Basic econometrics (5th ed.). McGraw-Hill.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.
Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Graphics Press.
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.
13 Appendix: Software & Package Citations
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://www.jstatsoft.org/v40/i03/
Iannone, R., Cheng, J., Schloerke, B., Haughton, S., Hughes, E., Lauer, A., François, R., Seo, J., Brevoort, K., & Roy, O. (2026). gt: Easily create presentation-ready display tables (R package version 1.3.0). https://CRAN.R-project.org/package=gt
Primary dataset citation:
Njoku, C. (2026). Dochase ADX CRM deal records — 2025 pipeline extract [Dataset]. Collected from Dochase ADX Zoho CRM, Lagos, Nigeria. Data available on request from the author.
14 Appendix: AI Usage Statement
Claude (Anthropic, version Claude Sonnet 4.6) was used as a coding assistant during the preparation of this document. Specifically, AI assistance was used for: (1) writing and debugging R code for data cleaning, visualisation (ggplot2), and statistical tests; (2) structuring the Quarto YAML header and document formatting; and (3) identifying and correcting a date-parsing error (dmy() vs mdy()) introduced by the CRM’s date format.
All analytical decisions were made independently by the analyst: the choice of which five techniques to apply and why, the formulation of both research hypotheses, the decision to retain outliers rather than remove them, the choice of median over mean imputation, the selection of Model 2 (log-transformed) as the preferred regression specification, and all business interpretations and strategic recommendations. The AI did not interpret statistical outputs or generate conclusions — those judgements reflect the analyst’s professional expertise at Dochase ADX and independent academic work.
Case Study 1 — Data Analytics II | Lagos Business School | Prof Bongo Adi | May 2026