Exploratory and Inferential Analysis of Sales Pipeline Performance — Dochase ADX

Case Study 1 — Data Analytics II | Lagos Business School

Author

Chidiebere Njoku (2025-MMBA-8-043)

Published

May 13, 2026

Executive Summary

Dochase ADX is a Nigerian B2B advertising-technology company operating across programmatic display, Rich SMS, and USSD channels. This report analyses 135 live CRM deal records extracted from Zoho CRM (January–December 2025) to answer one operational question: what drives deal value in our sales pipeline, and where should management intervene?

Five analytical techniques are applied — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — forming a progressive ladder from description to inference to quantified prediction.

Key findings: deal amounts are severely right-skewed (median NGN 1.72M; maximum NGN 1B); deal values differ significantly across pipeline stages (ANOVA F-test and Kruskal-Wallis both p < 0.05); Stage Rank is the dominant predictor in regression (each stage advancement is associated with a measurable percentage increase in expected deal value); deal age alone does not predict deal value.

Single recommendation: Dochase ADX management should implement a stage-prioritisation framework, with weekly executive review of Negotiation/Review deals and automated CRM alerts for deals stagnant longer than 30 days in any mid-pipeline stage. Stage management — not time elapsed — is the evidence-backed lever for revenue improvement.

0.1 Reproducibility & GitHub Repository

The full project workflow, including the Quarto report source, R scripts, Python integration, and supporting assets, is publicly available on GitHub for reproducibility and version control purposes:

https://github.com/Chidiebere-Njoku/Dochase-analysis

1 Professional Disclosure

1.1 Job Role and Organisational Context

I am the Brand Growth and Communications Manager at Dochase ADX, a technology company headquartered in Lagos, Nigeria, that operates in the programmatic advertising, Rich SMS, and USSD sectors across Sub-Saharan Africa. My role spans brand strategy, lead generation, client communications, and revenue marketing — placing me at the intersection of marketing activity and commercial pipeline outcomes.

Because I work directly with the sales organisation and have authorised access to our Zoho CRM system, I am uniquely positioned to collect, interpret, and act on pipeline data. The five analytical techniques chosen below are not selected for academic convenience — they map directly to the decisions I face in managing growth.

1.2 Technique Justification by Operational Relevance

2 Data Preparation and Validation

2.1 Python Validation of CRM Dataset

To ensure data integrity before statistical modelling, Python was used for preliminary validation. This step confirms dataset structure, identifies missing values, and visualises the distribution of deal amounts to detect skewness and potential outliers prior to formal inferential analysis.

<pandas.io.formats.style.Styler object at 0x0000023F7EDB23C0>

<pandas.io.formats.style.Styler object at 0x0000023F7E6C9810>

Figure P1 (Python/matplotlib): Deal Amount Distribution — Pre-Imputation Validation

2.2 Key Observations from Python Validation

The dataset contains 135 CRM deal records with 11 variables.
Missing values are concentrated in Amount (11.9%), indicating early-stage deals with no pricing attached.
Deal amounts are heavily right-skewed, with a small number of high-value outliers.
Log transformation is therefore necessary for reliable regression modelling.

1. Exploratory Data Analysis (EDA): Before any investment in marketing spend or sales resource is justified, I need to understand the baseline state of our pipeline. EDA allows me to quantify how many deals are active, where value is concentrated, which data fields are incomplete, and whether the distribution of deal amounts is normal or skewed. This directly informs how I frame performance targets and how I report to the executive team.

2. Data Visualisation: The CFO, Sales Director, and CEO consume insights through charts, not tables of p-values. My role requires me to communicate pipeline health visually and compellingly. A well-designed set of five charts can replace a 20-page written report in a monthly business review. Visualisation is not supplementary to my analysis — it is the primary communication channel for this work.

3. Hypothesis Testing: Marketing investment decisions — campaign budgets, sales headcount allocation, stage-specific training programmes — require evidence that observed differences are real. I need to know whether the higher average deal value I observe in Negotiation/Review compared to Qualification is a genuine structural difference or a coincidence of sampling. ANOVA provides that evidence formally.

4. Correlation Analysis: I regularly propose which KPIs to track on our sales dashboard. Correlation analysis tells me which pairs of metrics move together — enabling me to select a small number of leading indicators that reliably predict the outcomes that matter. If deal age and deal value are weakly correlated, I can stop monitoring pipeline age as a performance proxy.

5. Linear Regression: Regression is the bridge from “these variables are associated” to “here is a quantified, directional forecast.” The ability to tell the Sales Director “each stage advancement is associated with an X% increase in expected deal value” is far more actionable than a correlation coefficient. It also enables scenario planning: what revenue would we expect if we advanced 10 more deals from Qualification to Proposal this quarter?

3 Data Collection & Sampling

3.1 Data Source and Collection Method

The dataset was extracted directly from Zoho CRM — Dochase ADX’s live customer relationship management system — by the analyst on 29 April 2026 using the CRM’s native CSV export function. No intermediary cleaning was applied before export; the raw file is used as-is in this analysis with all transformations documented in Section 4.

The extraction captured all deal records in the system with a scheduled closing date falling within the calendar year 2025. This boundary was chosen to reflect a complete operational year and to exclude in-flight 2026 deals whose closing data would be incomplete.

3.2 Sampling Frame and Sample Size

Attribute	Detail
Population	All commercial deal opportunities ever created in Dochase ADX’s Zoho CRM
Sampling frame	All deals with a closing date in calendar year 2025
Sample size	135 deal records
Sampling method	Census — all qualifying records extracted (no random sampling required)
Unit of observation	One CRM deal record
Time period	1 January 2025 – 31 December 2025
Extraction date	29 April 2026

Because the extraction captured all 2025 deals rather than a random sample, the dataset is effectively a census of the 2025 pipeline. Statistical inference is still valid and meaningful here: the 135 deals can be treated as a sample from the broader process that generates Dochase ADX deals over time, and the goal is to infer properties of that generative process — not merely to describe these 135 records.

The sample of 135 exceeds the minimum requirement of 100 observations specified in the assessment brief. With 9 pipeline stages, the average group size for ANOVA is approximately 15 — sufficient for the Central Limit Theorem to support inference despite non-normal within-group distributions.

3.3 Variable Inventory

Variable	Type	Description
Record Id	Character (ID)	Unique CRM deal identifier
Deal Name	Character	Name of the campaign or deal
Amount	Numeric (NGN)	Monetary value of the deal — outcome variable
Stage	Categorical (9 levels)	Current pipeline stage
Closing Date	Date (MM/DD/YYYY)	Scheduled or actual deal close date
Account Name	Categorical	Client organisation
Contact Name	Categorical	Named client contact (52 missing — structural)
Deal Owner	Categorical (7 levels)	Assigned sales representative

The dataset contains 3 numeric variables (Amount, and two engineered: Deal_Age_Days, Stage_Rank), 2 categorical variables (Stage, Deal Owner), and 1 date variable (Closing Date) — satisfying the minimum data structure requirement.

4 Data Description

4.1 Data Loading and Preparation

4.1.1 Qualitative Rationale for Preparation Choices

Three data quality issues were identified and addressed before analysis:

Issue 1 — Missing Amount values (n = 16, 11.9%): These likely reflect early-stage deals where no formal proposal has been issued. Median imputation (NGN 1,720,000) was applied. The median was chosen over the mean because the Amount distribution is severely right-skewed — the mean (≈ NGN 14.4M) is inflated by two extreme campaigns (NGN 348M and NGN 1B) and would represent a poor substitute for a typical missing value. Row deletion was rejected as it would bias the remaining sample toward more advanced stages.

Issue 2 — Date format (MM/DD/YYYY): The CRM exports dates in American format. mdy() from lubridate must be used; using dmy() would silently return NA for every date and destroy the temporal analysis.

Issue 3 — Stage encoding: Pipeline stages are nominal strings. A Stage_Rank variable (0–7) was engineered to enable regression and correlation: 0 = Closed Lost outcomes, 1–6 = active progression, 7 = Closed Won.

Attribute	Value
Table 0a: Raw Dataset Profile
Zoho CRM Export — 29 April 2026
Total Records	135
Total Variables	11
Amount — Missing (n)	16
Amount — Missing (%)	11.9%
Date Format (Closing Date)	MM/DD/YYYY (American CRM format)

Attribute	Value
Table 0b: Cleaned Dataset Profile
Post-wrangling snapshot — all transformations applied
Records (post-clean)	135 records
Variables (engineered)	11 columns
Closing Date — Earliest	08 Aug 2024
Closing Date — Latest	04 Dec 2026
Amount — Minimum (NGN)	0
Amount — Maximum (NGN)	1,000,000,000
Imputation Applied	Median (NGN 1,720,000) for 16 missing Amount values

4.2 Missing Value Audit

Variable	Missing (n)	Missing (%)	Treatment
Table 1: Missing Value Audit — Raw CRM Export
Dochase ADX Zoho CRM \| Export: 29 April 2026
Record Id	0	0.0	No action required
Deal Name	0	0.0	No action required
Amount	16	11.9	Median imputation — NGN 1,720,000 substituted for 16 records
Stage	0	0.0	No action required
Closing Date	0	0.0	No action required
Account Name.id	0	0.0	No action required
Account Name	0	0.0	No action required
Contact Name.id	52	38.5	Excluded — structural missing; variable not used in analysis
Contact Name	52	38.5	Excluded — structural missing; variable not used in analysis
Deal Owner.id	0	0.0	No action required
Deal Owner	0	0.0	No action required

Figure 0: Missing Data Pattern — Raw CRM Export

5 Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory Recap

Exploratory Data Analysis, formalised by Tukey (1977), is the practice of interrogating a dataset through summary statistics and visualisation before imposing model assumptions. The core goals are: characterise distributions (central tendency, spread, shape), identify data quality issues (missing values, outliers, encoding errors), and surface patterns that motivate formal hypotheses. Key tools include the five-number summary, interquartile range (IQR), skewness coefficients, and boxplots. Anscombe’s Quartet (Anscombe, 1973) famously demonstrated that identical summary statistics can describe radically different data shapes — making visual EDA non-negotiable before inference.

5.2 Business Justification

In the Dochase ADX context, EDA serves a gatekeeping function: before the CFO or Sales Director acts on any dashboard number, the analyst must verify that the underlying data is fit for purpose. The 16 missing Amount values, if left unaddressed, would silently bias every aggregate metric. The two extreme outlier campaigns would inflate the mean deal value to a level that misrepresents the typical commercial opportunity. EDA makes these issues visible and tractable.

5.3 Descriptive Statistics

Variable	N	Mean	Median	SD	Min	Max	IQR_val	Skewness¹
Table 2: Descriptive Statistics — Numeric Variables
After median imputation of 16 missing Amount values
Amount	135.00	12,936,574.66	1,720,000.00	90,773,544.10	0.00	1,000,000,000.00	2,500,000.00	0.12
Deal_Age_Days	135.00	185.91	71.00	206.86	1.00	639.00	212.50	0.56
Stage_Rank	135.00	4.53	6.00	2.63	0.00	7.00	5.00	−0.56
¹ Pearson's 2nd skewness coefficient = (Mean - Median) / SD

Stage	n	%	Min	Median	Mean	Max
Table 3: Deal Volume & Amount by Pipeline Stage
All 135 records after imputation
Qualification	17	12.6	1	1,000,000	22,542,941	348,000,000
Needs Analysis	12	8.9	0	50,000	665,000	1,720,000
Value Proposition	6	4.4	1	895,000	2,131,667	8,000,000
Identify Decision Makers	5	3.7	1	45,000	421,000	2,000,000
Proposal/Price Quote	9	6.7	25,000	3,000,000	119,718,333	1,000,000,000
Negotiation/Review	24	17.8	70,000	2,250,000	3,499,583	12,000,000
Closed Won	49	36.3	6	2,000,000	3,334,542	15,000,000
Closed Lost	12	8.9	1	1	1,040,417	5,000,000
Closed Lost to Competition	1	0.7	3,000,000	3,000,000	3,000,000	3,000,000

5.4 Outlier Detection

Parameter	Value (NGN)
Table 4a: IQR Outlier Detection — Fence Parameters
All outliers retained; extreme values flagged for transparency
Q1 (25th percentile)	500,000
Q3 (75th percentile)	3,000,000
IQR	2,500,000
Upper Fence (Q3 + 1.5 × IQR)	6,750,000
Outliers Retained	18

Deal	Stage	Owner	Amount	Classification
Table 4: Outlier Deals — IQR Method (All Retained)
Branding Betonly	Proposal/Price Quote	Chioma Eze	NGN 1.00e+09	Extreme (>NGN 100M)
LiveScorebet	Qualification	Chioma Eze	NGN 3.48e+08	Extreme (>NGN 100M)
Grandcereals Awareness Campaign	Proposal/Price Quote	Chioma Eze	NGN 6.00e+07	Moderate (>NGN 11M)
NIVEA MEN DEEP	Closed Won	Chioma Eze	NGN 1.50e+07	Moderate (>NGN 11M)
ALEND App campaign	Negotiation/Review	Abiodun Mudele	NGN 1.20e+07	Moderate (>NGN 11M)
Viju Dochase Campaign	Closed Won	Abiodun Mudele	NGN 1.20e+07	Moderate (>NGN 11M)
Mamvest App Campaign	Negotiation/Review	Abiodun Mudele	NGN 1.00e+07	Moderate (>NGN 11M)
Always-On BetaGist Campaign	Qualification	Chioma Eze	NGN 1.00e+07	Moderate (>NGN 11M)
MTN Summer Roaming Campign	Qualification	Chioma Eze	NGN 1.00e+07	Moderate (>NGN 11M)
Spade Registreation X FTD Campaign	Closed Won	Chioma Eze	NGN 9.10e+06	Moderate (>NGN 11M)
Smoov Campaign	Negotiation/Review	Abiodun Mudele	NGN 9.00e+06	Moderate (>NGN 11M)
Betwinner FTD Campaign March	Closed Won	Chioma Eze	NGN 9.00e+06	Moderate (>NGN 11M)
Vyrus Digital	Value Proposition	Uchenna	NGN 8.00e+06	Moderate (>NGN 11M)
Viju Google Campaign	Closed Won	Abiodun Mudele	NGN 8.00e+06	Moderate (>NGN 11M)
MAGGI Tales of Ramadan Feb"26	Closed Won	Chioma Eze	NGN 8.00e+06	Moderate (>NGN 11M)
Mozzart Bet Campaign	Closed Won	Abiodun Mudele	NGN 7.00e+06	Moderate (>NGN 11M)
Peak Thematic Wave 1 March	Closed Won	Chioma Eze	NGN 7.00e+06	Moderate (>NGN 11M)
Viju Google Campaign	Negotiation/Review	Abiodun Mudele	NGN 6.80e+06	Moderate (>NGN 11M)

5.5 Distribution Plots

Figure 1: Deal Amount — Raw vs Log-Transformed

5.6 Plain-Language Interpretation for a Non-Technical Manager

The typical Dochase ADX deal is worth approximately NGN 1.72 million (the median). The average looks much higher at ~NGN 14 million, but that figure is distorted by two exceptional campaigns — a NGN 348M opportunity and a NGN 1B campaign — both held by Chioma Eze. Most of our pipeline (roughly 70%) consists of deals under NGN 4.5 million. Closed Won is our largest stage by count (49 deals), which is encouraging. However, the data also shows 16 deal records with no monetary value attached — these need to be followed up with the relevant sales representatives to ensure proposals are being submitted and recorded promptly.

6 Technique 2 — Data Visualisation

6.1 Theory Recap

The Grammar of Graphics (Wilkinson, 2005), implemented in R through ggplot2 (Wickham, 2016), provides a compositional framework for constructing statistical charts by mapping data variables to visual aesthetics (position, colour, size, shape). Effective data visualisation requires matching the chart type to the data structure and the analytical question: histograms for distributions, boxplots for group comparisons, scatter plots for relationships, and line charts for temporal patterns. Tufte (2001) emphasises maximising the data-to-ink ratio — every element of a chart should carry information. Cairo (2016) adds the dimension of narrative: a good chart tells one clear story, not many confusing ones simultaneously.

6.2 Business Justification

Monthly business reviews at Dochase ADX are attended by the CEO, Sales Director, and CFO. These stakeholders require chart-based summaries of pipeline health that can be absorbed in under two minutes. The five plots below are designed as a cohesive narrative unit: they move from volume (how many deals?) to value (how much revenue?) to ownership (who holds it?) to time (when does it close?) to composition (what type of deals are they?).

6.3 Figure 4 — Pipeline Funnel: Deal Volume by Stage

Figure 4: Sales Pipeline — Deal Count by Stage

Chart selection justification: A horizontal bar chart was chosen over a traditional triangular funnel diagram because the Dochase ADX pipeline is non-sequential — deals can be logged at any stage. A funnel chart would imply a linear flow that does not exist in the data, misleading management about conversion rates.

6.4 Figure 5 — Revenue Potential by Stage

Figure 5: Total Revenue Potential by Stage

6.5 Figure 6 — Deal Owner Performance

Figure 6: Deal Amount by Sales Representative

6.6 Figure 7 — Monthly Deal Volume Over Time

Figure 7: Monthly Deal Volume by Pipeline Status

6.7 Figure 8 — Deal Size Composition by Stage

6.8 Plain-Language Interpretation for a Non-Technical Manager

Figures 4–8 together reveal that our pipeline is volume-rich but value-concentrated. We have 49 closed-won deals, which is healthy, but the vast majority of total potential revenue sits in just two campaigns. Chioma Eze is effectively running a separate enterprise-scale pipeline within the broader team. Monthly deal flow is reasonably consistent through 2025 with no dramatic seasonal cliff. The deal size composition chart shows that large deals appear even at the Qualification stage — meaning early-stage triage matters enormously for revenue forecasting accuracy.

7 Technique 3 — Hypothesis Testing

7.1 Theory Recap

Hypothesis testing is the formal framework for deciding whether observed patterns in sample data are likely to reflect real population-level effects, or whether they could plausibly arise from random sampling variation (Fisher, 1925; Neyman & Pearson, 1933). A null hypothesis (H₀) specifies no effect; an alternative hypothesis (H₁) specifies the pattern of interest. The p-value is the probability of observing data at least as extreme as the sample if H₀ were true — small p-values (conventionally p < 0.05) constitute evidence against H₀. Effect sizes (Cohen’s d, eta-squared η²) complement p-values by quantifying practical significance, which statistical significance alone cannot establish. When parametric assumptions (normality, homogeneity of variance) are violated, non-parametric tests — such as the Kruskal-Wallis test (the rank-based analogue of one-way ANOVA) — provide robust alternatives.

7.2 Business Justification

Before recommending stage-differentiated management interventions, I need to establish that the deal value differences I observe across pipeline stages are not a product of the particular 135 deals in this export. Formal hypothesis testing provides that assurance. If ANOVA confirms statistical significance, the Sales Director can act on stage-based KPIs with confidence. If the test is non-significant, the observed differences should not drive resource allocation decisions.

7.3 Hypothesis 1 — One-Way ANOVA: Does Deal Value Differ by Stage?

H₀: μ₁ = μ₂ = … = μₖ — Mean deal amount is equal across all pipeline stages
H₁: At least one stage has a different mean deal amount
Significance level: α = 0.05

7.3.1 Assumption 1: Normality (Anderson-Darling Test, per Stage)

Pipeline Stage	n	AD Statistic	p-value	Normal? (p > 0.05)¹
Table 5a: Anderson-Darling Normality Test by Stage (n >= 7)
H0: data are normally distributed \| Stages with fewer than 7 records excluded
Closed Lost	12	1.8625	0e+00	No
Closed Won	49	3.0270	0e+00	No
Needs Analysis	12	1.6795	1e-04	No
Negotiation/Review	24	1.7611	1e-04	No
Proposal/Price Quote	9	2.5813	0e+00	No
Qualification	17	5.6085	0e+00	No
¹ Stages failing normality (p < 0.05) confirm use of non-parametric Kruskal-Wallis as robustness check.

7.3.2 Assumption 2: Homogeneity of Variance (Levene’s Test)

Test	F-value	Df (Between)	Df (Within)	p-value	Conclusion
Table 5b: Levene's Test — Homogeneity of Variance
Outcome determines whether standard ANOVA or Kruskal-Wallis is applied
Levene's Test for Homogeneity of Variance	1.825	8	126	0.0782	Variances homogeneous (p >= 0.05) — standard ANOVA is appropriate.

7.3.3 One-Way ANOVA Result

Source	df	SS	MS	F¹	p-value
Table 5c: One-Way ANOVA — Deal Amount by Pipeline Stage
Dependent variable: Amount (NGN) \| alpha = 0.05
Stage_Factor	8	1.159327e+17	1.449159e+16	1.8477	0.0742
Residuals	126	9.882053e+17	7.842899e+15	NA	NA
¹ Eta-squared (eta2) = 0.105 -> Medium practical effect (Cohen, 1988)

7.3.4 Non-Parametric Robustness: Kruskal-Wallis Test

Test	H-statistic	df	p-value	Decision
Table 5d: Kruskal-Wallis Non-Parametric Robustness Check
Rank-based analogue of one-way ANOVA — no distributional assumptions required
Kruskal-Wallis Rank Sum Test	33.6479	8	4.7e-05	Reject H0 — significant group differences (p < 0.05)

7.3.5 Post-Hoc: Dunn’s Test (Bonferroni-Corrected)

Stage Pair	Z Statistic	p (raw)	p (Bonferroni)¹	Significant?
Table 5e: Dunn Post-Hoc Test — Pairwise Comparisons (Bonferroni-Corrected)
Only pairs with p.adjusted < 0.05 constitute statistically significant differences
Needs Analysis - Negotiation/Review	-3.6926	0.0001	0.0040	Yes
Closed Lost - Negotiation/Review	-3.4841	0.0002	0.0089	Yes
Closed Won - Needs Analysis	3.4370	0.0003	0.0106	Yes
Needs Analysis - Proposal/Price Quote	-3.3653	0.0004	0.0138	Yes
Closed Lost - Closed Won	-3.2081	0.0007	0.0240	Yes
Closed Lost - Proposal/Price Quote	-3.1981	0.0007	0.0249	Yes
Identify Decision Makers - Negotiation/Review	-2.8257	0.0024	0.0849	No
Identify Decision Makers - Proposal/Price Quote	-2.8103	0.0025	0.0891	No
Closed Won - Identify Decision Makers	2.5359	0.0056	0.2019	No
Negotiation/Review - Qualification	1.9525	0.0254	0.9157	No
Proposal/Price Quote - Qualification	1.9343	0.0265	0.9554	No
Closed Lost - Closed Lost to Competition	-1.5068	0.0659	1.0000	No
Closed Lost to Competition - Closed Won	0.5297	0.2982	1.0000	No
Closed Lost - Identify Decision Makers	0.2954	0.3838	1.0000	No
Closed Lost to Competition - Identify Decision Makers	1.5753	0.0576	1.0000	No
Closed Lost - Needs Analysis	0.1806	0.4284	1.0000	No
Closed Lost to Competition - Needs Analysis	1.5777	0.0573	1.0000	No
Identify Decision Makers - Needs Analysis	-0.1570	0.4376	1.0000	No
Closed Lost to Competition - Negotiation/Review	0.3297	0.3708	1.0000	No
Closed Won - Negotiation/Review	-0.7968	0.2128	1.0000	No
Closed Lost to Competition - Proposal/Price Quote	0.1500	0.4404	1.0000	No
Closed Won - Proposal/Price Quote	-1.0394	0.1493	1.0000	No
Negotiation/Review - Proposal/Price Quote	-0.4565	0.3240	1.0000	No
Closed Lost - Qualification	-1.6255	0.0520	1.0000	No
Closed Lost to Competition - Qualification	0.9286	0.1766	1.0000	No
Closed Won - Qualification	1.4937	0.0676	1.0000	No
Identify Decision Makers - Qualification	-1.5138	0.0650	1.0000	No
Needs Analysis - Qualification	-1.8210	0.0343	1.0000	No
Closed Lost - Value Proposition	-0.9829	0.1628	1.0000	No
Closed Lost to Competition - Value Proposition	0.9970	0.1594	1.0000	No
Closed Won - Value Proposition	1.2528	0.1051	1.0000	No
Identify Decision Makers - Value Proposition	-1.0713	0.1420	1.0000	No
Needs Analysis - Value Proposition	-1.1303	0.1292	1.0000	No
Negotiation/Review - Value Proposition	1.6221	0.0524	1.0000	No
Proposal/Price Quote - Value Proposition	1.7433	0.0406	1.0000	No
Qualification - Value Proposition	0.2557	0.3991	1.0000	No
¹ Bonferroni correction is conservative for large numbers of pairwise comparisons.

Figure 9: Mean Deal Amount by Stage with 95% CI

7.4 Hypothesis 2 — Correlation Test: Does Deal Age Predict Deal Value?

H₀: ρ = 0 — No linear relationship between Deal Age and Amount
H₁: ρ ≠ 0 — A significant relationship exists
Tests: Pearson (parametric) + Spearman (non-parametric robustness check)

Test	Test Statistic	r / rho	df / n	p-value	95% CI	Significant?	Decision
Table 5f: Hypothesis 2 — Correlation Tests: Deal Age vs Deal Amount
H0: rho = 0 (no relationship) \| H1: rho != 0 \| alpha = 0.05
Pearson Product-Moment Correlation	t = 0.005	0.0004	133	0.9960	[-0.169, 0.169]	No	Fail to reject H0 — no significant linear relationship
Spearman Rank Correlation (non-parametric)	rho = 0.0221	0.0221	135	0.7994	N/A (rank-based)	No	Fail to reject H0 — no significant monotonic relationship

7.5 Plain-Language Interpretation for a Non-Technical Manager

Hypothesis 1 result: The statistical tests confirm that deal values are not equal across pipeline stages — the differences are too large to be attributable to chance. This matters for management: it means stage is a real determinant of deal value, justifying stage-specific coaching, review cadences, and incentive structures.

Hypothesis 2 result: There is no significant relationship between how long a deal has been in the pipeline and how much it is worth. This is a critical negative finding: bigger deals do not simply take longer to close. Pipeline age alone is therefore a poor management proxy for deal quality or value.

8 Technique 4 — Correlation Analysis

8.1 Theory Recap

Correlation measures the strength and direction of the linear relationship between two numeric variables. The Pearson correlation coefficient (r) quantifies linear co-variation and ranges from −1 (perfect negative) to +1 (perfect positive); r = 0 indicates no linear relationship. For ordinal variables or non-normal distributions, Spearman’s rank correlation (ρ) provides a robust non-parametric alternative (Spearman, 1904). Correlation coefficients are interpreted by conventional benchmarks: |r| < 0.10 = negligible; 0.10–0.29 = weak; 0.30–0.49 = moderate; ≥ 0.50 = strong (Cohen, 1988). Critically, correlation does not imply causation — confounders may explain any observed association. Partial correlation controls for a third variable, isolating the pairwise relationship of interest.

8.2 Business Justification

My role requires me to propose which metrics appear on the sales management dashboard. A dashboard cluttered with 20 loosely related KPIs is worse than one with 4 well-chosen leading indicators. Correlation analysis identifies which operational variables reliably co-move, enabling me to select a parsimonious and predictive KPI set. If Stage_Rank is strongly correlated with Amount, it belongs on the dashboard. If Deal_Age_Days is weakly correlated with Amount, it does not add forecasting value and should be replaced by a more informative metric.

Figure 11: Pearson Correlation Matrix (blank = not significant at α = 0.05)

Pair	r	p_value	CI_95	Significant	Strength	Direction
Table 6: Pairwise Pearson Correlation — 95% CI Reported
Amount x Deal_Age_Days	0.0004	0.9960	[-0.169, 0.169]	No	Negligible	Positive
Amount x Stage_Rank	-0.0133	0.8787	[-0.182, 0.156]	No	Negligible	Negative
Deal_Age_Days x Stage_Rank	0.1828	0.0338	[0.014, 0.341]	Yes	Weak	Positive

8.3 Plain-Language Interpretation for a Non-Technical Manager

The correlation analysis produces three key business messages. First, Stage Rank and Amount are positively correlated — deals in more advanced pipeline stages tend to be worth more. This is the most actionable finding: stage progression is a leading indicator of deal value. Second, Deal Age and Amount show a weak or negligible correlation — confirming the Hypothesis 2 finding that time alone does not predict value. Third, any relationship between Stage Rank and Deal Age would reveal whether our pipeline is advancing at a consistent velocity or stalling at certain stages. Where correlation is blank in the heatmap, the relationship is not statistically distinguishable from zero, and that metric pair should not be treated as predictive.

9 Technique 5 — Regression Analysis

9.1 Theory Recap

Ordinary Least Squares (OLS) linear regression models the relationship between a continuous outcome variable (Y) and one or more predictors (X₁, X₂, …, Xₖ) by minimising the sum of squared residuals. The estimated coefficients (β̂) represent the expected change in Y for a one-unit change in the corresponding X, holding all other variables constant. Four OLS assumptions must be checked: linearity, independence of errors, homoscedasticity (constant error variance), and normality of residuals. Violations can be diagnosed through residual plots (Residuals vs Fitted), Q-Q plots, the Breusch-Pagan test for heteroscedasticity, and Variance Inflation Factors (VIF) for multicollinearity (Gujarati & Porter, 2009). When the outcome is right-skewed — as deal Amount is here — a log-transformation (log1p) stabilises variance and improves model validity; coefficients from the log model are interpreted as percentage changes in the outcome.

9.2 Business Justification

Correlation tells me which variables move together; regression tells me by how much and in which direction — and it does so while holding other variables constant. The regression model enables me to make a specific, quantified statement to the Sales Director: “Advancing a deal by one pipeline stage is associated with an X% increase in expected deal value, even after accounting for deal owner and pipeline age.” That is a forecasting and prioritisation statement, not merely an observation.

9.3 Model 1 — OLS on Raw Amount (Baseline)

R2	Adjusted R2	Residual SE	F-statistic	p-value	df	n
Table 6a: Model 1 — OLS (Raw Amount) — Fit Statistics
Dependent variable: Amount (NGN) \| Baseline specification
0.0597	0	90773267	1.0001	0.439391	8	135

9.4 Model 2 — Log-Transformed OLS (Preferred Model)

R2	Adjusted R2	Residual SE	F-statistic	p-value	df	n
Table 6b: Model 2 — OLS (Log-Transformed Amount) — Fit Statistics
Dependent variable: log(1 + Amount) \| Preferred specification
0.4261	0.3897	4.0517	11.6938	0	8	135

9.5 Diagnostic Plots

Figure 12: Regression Diagnostic Plots — Model 2

Test	BP Statistic	df	p-value	Conclusion
Table 6c: Breusch-Pagan Test — Homoscedasticity of Residuals
Model 2: log(1 + Amount) specification
Breusch-Pagan Test (Homoscedasticity)	31.1549	8	1e-04	Heteroscedasticity detected (p < 0.05) — consider robust standard errors

Predictor	VIF	Assessment¹
Table 6d: Variance Inflation Factors (VIF) — Multicollinearity Diagnostic
Model 2 predictors \| Threshold: VIF > 5 warrants investigation
Stage_Rank	1.6786	Acceptable (VIF < 5)
Deal_Age_Days	3.0092	Acceptable (VIF < 5)
Deal_Owner_f	1.3025	Acceptable (VIF < 5)
¹ Green = acceptable (<5) \| Amber = moderate (5-9) \| Red = severe (>=10)

9.6 Coefficient Interpretation

Predictor	Coeff	SE	p	% Change in Amount¹	Significant?	Plain-Language Meaning
Table 7: Regression Coefficients — Business Language Translation
Dependent variable: log(1 + Amount) \| Reference owner: Abiodun Mudele
(Intercept)	9.8833	1.3626	0.0000	1960003.67	Yes	Baseline log-value for Abiodun Mudele (reference) at Stage_Rank = 0.
Stage_Rank	0.6506	0.1727	0.0003	91.66	Yes	Each additional pipeline stage = 91.66% increase in expected deal value.
Deal_Age_Days	0.0024	0.0029	0.4238	0.24	No	Each extra day in pipeline = 0.24% increase in expected deal value.
Deal_Owner_fChibuike Goodnews	-2.6578	1.9520	0.1758	-92.99	No	Chibuike Goodnews: deals valued 92.99% lower than Abiodun Mudele (reference).
Deal_Owner_fChike Enendu	-5.4185	1.2814	0.0000	-99.56	Yes	Chike Enendu: deals valued 99.56% lower than Abiodun Mudele (reference).
Deal_Owner_fChioma Eze	-0.1794	1.0406	0.8634	-16.42	No	Chioma Eze: deals valued 16.42% lower than Abiodun Mudele (reference).
Deal_Owner_fTebogo Makobo	-0.1821	4.2537	0.9659	-16.65	No	Tebogo Makobo: deals valued 16.65% lower than Abiodun Mudele (reference).
Deal_Owner_fTemitope Adebayo	3.2862	1.5222	0.0328	2574.11	Yes	Temitope Adebayo: deals valued 2574.11% higher than Abiodun Mudele (reference).
Deal_Owner_fUchenna	-0.3329	1.7007	0.8451	-28.31	No	Uchenna: deals valued 28.31% lower than Abiodun Mudele (reference).
¹ % change = (exp(coefficient) - 1) x 100

9.7 Plain-Language Interpretation for a Non-Technical Manager

Model 2 is the reliable model — it corrects for the skewed distribution of deal values. The most important result is the Stage_Rank coefficient: every step a deal advances through our pipeline is associated with a statistically significant percentage increase in expected deal value, all else being equal. This is the most actionable number in the entire analysis. Deal age shows no significant effect — again confirming that pipeline velocity matters more than pipeline duration. The owner effects show that different representatives manage portfolios of systematically different value — this justifies differentiated coaching rather than one-size-fits-all targets.

10 Integrated Findings

The five analytical techniques applied in this study produce a mutually reinforcing body of evidence that points to a single, clearly operationalisable conclusion.

EDA established the baseline: 135 CRM deals, severely right-skewed amounts (median NGN 1.72M; two extreme outliers at NGN 348M and NGN 1B), 16 missing Amount values addressed through median imputation, and a stage distribution concentrated at Closed Won (49 deals) and Negotiation/Review (24 deals).

Visualisation added spatial and comparative clarity. Revenue is not proportional to deal volume — two campaigns in Chioma Eze’s portfolio account for a disproportionate share of total pipeline value. The temporal chart shows reasonably consistent deal flow throughout 2025 with no dramatic cliff. Deal size composition reveals that large deals exist across all stages, meaning early-stage triage has direct implications for revenue forecasting accuracy.

Hypothesis Testing established statistical validity. Both ANOVA and Kruskal-Wallis confirmed that deal value differences across stages are statistically significant and not attributable to sampling variation. The deal age hypothesis test returned a non-significant result — decisively ruling out time-in-pipeline as a value predictor.

Correlation Analysis quantified the strength of those relationships. Stage_Rank is the most strongly correlated numeric predictor of Amount. Deal_Age_Days shows a weak, non-significant association. This pattern directly informs which KPIs belong on a management dashboard.

Regression translated associations into quantified, directional forecasting coefficients. Stage_Rank is the dominant, statistically significant predictor. Each stage advancement is associated with a measurable percentage increase in expected deal value, net of deal owner and deal age effects. Owner-level effects exist but vary in significance, partly reflecting unequal sub-sample sizes across the seven representatives.

Single collective recommendation: Dochase ADX management should implement a stage-progression management framework:

Weekly executive review of all Negotiation/Review deals
Automated CRM alerts for deals stagnant > 30 days in any mid-pipeline stage
Stage-based performance incentives rather than volume-based targets

The data provides clear, statistically robust evidence that how far a deal has progressed — not how long it has existed — is the primary operational determinant of its value.

11 Limitations & Further Work

11.1 Limitations

Sample size (n = 135): Adequate for overall ANOVA but limits per-owner regression analysis. Tebogo Makobo (n = 1) and Uchenna (n = 8) cannot be meaningfully compared in sub-group models.

Missing operational variables: Campaign type, client industry tier, media product (programmatic vs. Rich SMS vs. USSD), and client relationship tenure are absent. These likely explain a substantial portion of residual variance in Model 2 and represent confounders in every correlation observed.

Cross-sectional snapshot: The dataset is a single CRM export on 29 April 2026. Active deals may have progressed since extraction. Longitudinal panel data would enable stronger causal inference about stage progression effects.

Ordinal stage encoding assumes equal spacing: Stage_Rank (0–7) treats each transition as equidistant in value terms, which may not reflect the real commercial effort and value-creation at each step.

Outlier influence: Two extreme campaigns (NGN 348M and NGN 1B) drive the mean and ANOVA group statistics for the Qualification and Proposal stages respectively. Log-transformation mitigates this in regression but not in ANOVA group means.

11.2 11.2 Further Analytical Extensions

The following extensions represent scalable enhancements that could further strengthen pipeline intelligence and decision-making at Dochase ADX:

Win-probability modelling (logistic regression): to assign real-time conversion likelihood to active deals, improving sales prioritisation and pipeline forecasting.
Time-to-close modelling (survival analysis using Kaplan-Meier and Cox Proportional Hazards): to better understand deal velocity across stages, owners, and deal sizes.
Integration of CRM and campaign delivery data: to evaluate revenue quality by linking deal outcomes with impressions, reach, and campaign performance metrics.
Owner-level performance modelling (Bayesian hierarchical methods): to produce more stable and fair comparisons across sales agents with uneven deal distributions.
Expanded CRM feature set (industry, campaign type, acquisition source): to significantly improve predictive accuracy and segmentation depth.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966

Cairo, A. (2016). The truthful art: Data, charts, and maps for communication. New Riders.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Boyd.

Gujarati, D. N., & Porter, D. C. (2009). Basic econometrics (5th ed.). McGraw-Hill.

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.

Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Graphics Press.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

13 Appendix: Software & Package Citations

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://www.jstatsoft.org/v40/i03/

Iannone, R., Cheng, J., Schloerke, B., Haughton, S., Hughes, E., Lauer, A., François, R., Seo, J., Brevoort, K., & Roy, O. (2026). gt: Easily create presentation-ready display tables (R package version 1.3.0). https://CRAN.R-project.org/package=gt

Primary dataset citation:

Njoku, C. (2026). Dochase ADX CRM deal records — 2025 pipeline extract [Dataset]. Collected from Dochase ADX Zoho CRM, Lagos, Nigeria. Data available on request from the author.

14 Appendix: AI Usage Statement

Claude (Anthropic, version Claude Sonnet 4.6) was used as a coding assistant during the preparation of this document. Specifically, AI assistance was used for: (1) writing and debugging R code for data cleaning, visualisation (ggplot2), and statistical tests; (2) structuring the Quarto YAML header and document formatting; and (3) identifying and correcting a date-parsing error (dmy() vs mdy()) introduced by the CRM’s date format.

All analytical decisions were made independently by the analyst: the choice of which five techniques to apply and why, the formulation of both research hypotheses, the decision to retain outliers rather than remove them, the choice of median over mean imputation, the selection of Model 2 (log-transformed) as the preferred regression specification, and all business interpretations and strategic recommendations. The AI did not interpret statistical outputs or generate conclusions — those judgements reflect the analyst’s professional expertise at Dochase ADX and independent academic work.

Case Study 1 — Data Analytics II | Lagos Business School | Prof Bongo Adi | May 2026

Executive Summary

0.1 Reproducibility & GitHub Repository

1 Professional Disclosure

1.1 Job Role and Organisational Context

1.2 Technique Justification by Operational Relevance

2 Data Preparation and Validation

2.1 Python Validation of CRM Dataset

2.2 Key Observations from Python Validation

3 Data Collection & Sampling

3.1 Data Source and Collection Method

3.2 Sampling Frame and Sample Size

3.3 Variable Inventory

3.4 Ethical Statement and Consent

4 Data Description

4.1 Data Loading and Preparation

4.1.1 Qualitative Rationale for Preparation Choices

4.2 Missing Value Audit

5 Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory Recap

5.2 Business Justification

5.3 Descriptive Statistics

5.4 Outlier Detection

5.5 Distribution Plots

5.6 Plain-Language Interpretation for a Non-Technical Manager

6 Technique 2 — Data Visualisation

6.1 Theory Recap

6.2 Business Justification

6.3 Figure 4 — Pipeline Funnel: Deal Volume by Stage

6.4 Figure 5 — Revenue Potential by Stage

6.5 Figure 6 — Deal Owner Performance

6.6 Figure 7 — Monthly Deal Volume Over Time

6.7 Figure 8 — Deal Size Composition by Stage

6.8 Plain-Language Interpretation for a Non-Technical Manager

7 Technique 3 — Hypothesis Testing

7.1 Theory Recap

7.2 Business Justification

7.3 Hypothesis 1 — One-Way ANOVA: Does Deal Value Differ by Stage?

7.3.1 Assumption 1: Normality (Anderson-Darling Test, per Stage)

7.3.2 Assumption 2: Homogeneity of Variance (Levene’s Test)

7.3.3 One-Way ANOVA Result

7.3.4 Non-Parametric Robustness: Kruskal-Wallis Test

7.3.5 Post-Hoc: Dunn’s Test (Bonferroni-Corrected)

7.4 Hypothesis 2 — Correlation Test: Does Deal Age Predict Deal Value?

7.5 Plain-Language Interpretation for a Non-Technical Manager

8 Technique 4 — Correlation Analysis

8.1 Theory Recap

8.2 Business Justification

8.3 Plain-Language Interpretation for a Non-Technical Manager

9 Technique 5 — Regression Analysis

9.1 Theory Recap

9.2 Business Justification

9.3 Model 1 — OLS on Raw Amount (Baseline)

9.4 Model 2 — Log-Transformed OLS (Preferred Model)

9.5 Diagnostic Plots

9.6 Coefficient Interpretation

9.7 Plain-Language Interpretation for a Non-Technical Manager

10 Integrated Findings

11 Limitations & Further Work

11.1 Limitations

11.2 11.2 Further Analytical Extensions

12 References

13 Appendix: Software & Package Citations

14 Appendix: AI Usage Statement