Lagos Business School • April 2026
Capstone Case Study
Data Analytics II
Assessment Brief
Prof Bongo Adi • Lagos Business School
⚠️ SELECT ONE CASE STUDY ONLY
Three case studies are provided below. You are required to complete
exactly one (1) of them. Read all three before choosing —
select the case that best fits the data you can realistically collect
from your own professional context.
Submitting more than one case study will result in only the
first submitted being marked.
Assessment weight: This case study carries a total of
100 marks, broken into three components: 30 marks for
real-data generation and business-operations mapping, 50 marks for the
submitted analytical work, and 20 marks for a defence
held approximately one week after submission. The 30-mark practical
component compensates for the practical assignment portion of the course.
This is an individual assignment — each student must submit
and defend their own independent work.
Submission format: Live HTML published on RPubs or
Posit Connect Cloud.
Tool: Quarto in RStudio or Positron.
👥 INDIVIDUAL ASSIGNMENT
Each student must independently collect their own data, conduct their own
analysis, and submit their own Quarto document. Identical or near-identical
submissions, shared datasets presented as independent, or any form of
collusion will be treated as academic misconduct. You will be required to
defend your work in person — you must be able to explain every line
of code and every result.
| # | Theme | Five Techniques | Marks (A + B + Defence) |
| CS 1 |
Exploratory & Inferential Analytics |
EDA · Visualisation · Hypothesis Testing · Correlation · Regression |
100 (30 + 50 + 20) |
| CS 2 |
Predictive Modelling & Segmentation |
Classification · Explainability · Clustering · Dimensionality Reduction · Time Series |
100 (30 + 50 + 20) |
| CS 3 |
Advanced & Operational Analytics |
Text Analytics · Monte Carlo · Advanced Forecasting · Customer/People Analytics · Optimisation or Association Rules |
100 (30 + 50 + 20) |
Each submission must use real data collected from your own workplace,
professional practice, or a Nigerian/African organisation you have direct access to.
Simulated or publicly downloaded datasets may only supplement primary data
— they cannot replace it.
General Instructions & Assessment Philosophy
This assessment asks you to act as a practising data scientist tackling a real problem
in your own organisation or sector. You will collect or extract genuine data, frame an
analytical question, apply five techniques from the textbook, and communicate your
findings through a reproducible Quarto document.
1.1 The Real-Data Requirement
30 marks
The defining feature of this assessment — and the source of the 30-mark
practical component — is the use of real data that clearly maps to
your business operations. These 30 marks are awarded for the quality,
transparency, and operational relevance of your data, not for the analyses
themselves. Each submission must include:
- A professional disclosure statement (Section 2 of your document)
explaining your job role, your organisation's domain, and why the five techniques
you chose are directly relevant to your day-to-day work.
- At least one primary dataset you collected yourself — from your
organisation's systems, a survey you designed and administered, web scraping, or
direct field observation. The dataset must contain a minimum of
100 observations and at least 5 variables.
- A data provenance section describing: how the data were collected,
the sampling frame, the time period covered, ethical approvals or consent obtained,
and any data-sharing restrictions that affect what you can publish.
- A technique justification paragraph for each of the five methods,
explaining why it is the appropriate technique given your specific business
context and data structure.
Academic integrity: Submitting simulated data presented as real data
is academic misconduct and will result in a grade of zero for the entire submission.
Prof Bongo Adi may ask you to present your data collection process during a viva voce.
30-Mark Practical Component — Breakdown
| Real-Data & Business-Operations Component | Marks |
| Primary data collection: documented methodology, source, tools used | 10 |
| Sampling justification: sample frame, sample size, period, and statistical rationale | 10 |
| Clear mapping to your business operations: each technique linked to a real decision or process in your organisation | 10 |
| SECTION A TOTAL | 30 |
1.2 Quarto Document Requirements
Your submission is a single Quarto (.qmd) document rendered to HTML. It must:
- Open with a YAML header specifying
title, author,
date, format: html, toc: true,
code-fold: true, self-contained: true.
- Use R and/or Python code chunks. A panel-tabset showing both languages is
strongly encouraged and attracts higher marks within the 100-mark component.
- Include all data loading, cleaning, and analysis code — the document
must be fully reproducible from top to bottom.
- Embed all visualisations inline; no external image files.
- Be published as a live HTML URL on
RPubs (rpubs.com/publish from RStudio) or
Posit Connect Cloud (connect.posit.cloud from Positron).
- Optionally: push the
.qmd source and data files to a public
GitHub repository and include the repo URL in your submission
— this attracts up to +5 bonus marks.
A minimal YAML header:
---
title: "[Your Case Study Title]"
author: "[Your Full Name]"
date: today
format:
html:
theme: flatly
toc: true
code-fold: true
self-contained: true
---
1.3 Standard Document Structure
| Section | Content Required |
| 1. Executive Summary |
150–200 words: business problem, data collected, key findings, and recommendation. |
| 2. Professional Disclosure |
Your job title, organisation type/sector, and a paragraph for each technique explaining its operational relevance to your work. (Assessed in the 30-mark component.) |
| 3. Data Collection & Sampling |
Source, collection method, sampling frame, sample size, time period covered, and ethical notes or consent statement. (Assessed in the 30-mark component.) |
| 4. Data Description |
Variable names, types, and distributions produced with EDA code. |
| 5–9. Analysis (one section per technique) |
For each technique: brief theory recap, business justification, code, output, and plain-language interpretation for a non-technical manager. |
| 10. Integrated Findings |
How do the five analyses fit together? What single recommendation do they collectively support? |
| 11. Limitations & Further Work |
What would you do differently with more data, time, or computing power? |
| References |
APA format. Cite the textbook, R/Python packages (use citation("pkgname")), and data sources. |
| Appendix: AI Usage Statement |
One paragraph describing which AI tools (if any) assisted with coding, and where you exercised independent analytical judgement. |
1.4 Full Marking Scheme (100 marks total)
Section A — Real-Data & Business-Operations Component (30 marks)
See Section 1.1 breakdown above.
Section B — Submitted Analytical Work (50 marks)
| Analytical Component | Marks |
| Professional disclosure quality and depth of context linkage | 5 |
| Correct and appropriate application of each of the five techniques (5 × 6 marks) | 30 |
| Depth of business interpretation per technique | 8 |
| Code quality, reproducibility, and document structure | 4 |
| Integrated conclusion and actionable recommendation | 3 |
| SECTION B TOTAL | 50 |
Section C — Defence / Viva Voce (20 marks) — held approximately one week after submission
The defence is a short individual oral examination (approx. 10–15 minutes) conducted by
Prof Bongo Adi or a designated examiner. You will be asked to explain your data collection
process, justify your analytical choices, and interpret selected outputs on the spot.
No slides are required — bring your live HTML document. The defence confirms that the
submitted work is genuinely your own.
| Defence Component | Marks |
| Ability to explain analytical decisions and justify technique selection | 8 |
| Correct interpretation of model outputs and statistical results under questioning | 8 |
| Demonstrated ownership: evidence that the data, code, and conclusions are genuinely yours | 4 |
| SECTION C TOTAL | 20 |
| Grand Summary | Marks |
| Section A — Real-data generation & business-operations mapping | 30 |
| Section B — Submitted analytical work | 50 |
| Section C — Defence / viva voce (~1 week after submission) | 20 |
| GitHub repository (bonus) | +5 |
| GRAND TOTAL (excl. bonus) | 100 |
Case Study 1 — Exploratory & Inferential Analytics
Theme: Understanding the story in your data before building models
• Total marks: 100
1.1 Overview
This case study focuses on the first and most important phase of any analysis:
understanding what you have. Before fitting models, a rigorous analyst spends
significant time on exploratory data analysis, visualisation, and formal statistical
testing. You will apply these foundational techniques to data from your own professional
context and demonstrate that you can move fluently between exploratory insight and
inferential conclusion.
1.2 Required Techniques
| # | Technique | Book Reference |
| 1 | Exploratory Data Analysis (EDA) |
Ch. 4 — Summary stats, missing-value analysis, outlier detection, Anscombe's Quartet |
| 2 | Data Visualisation |
Ch. 5 — Grammar of graphics, chart selection, storytelling with data |
| 3 | Hypothesis Testing |
Ch. 6 — t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes |
| 4 | Correlation Analysis |
Ch. 8 — Pearson, Spearman, Kendall; partial correlation; correlation vs causation |
| 5 | Linear or Logistic Regression |
Ch. 9 (OLS) or Ch. 13 (logistic) — coefficients, diagnostics, interpretation |
1.3 Business Context Examples (illustrative only — use your own)
- Bank credit officer: Collect anonymised loan application data from
your team's portfolio. EDA the distribution of loan amounts and repayment status.
Test whether interest rates differ significantly across loan categories. Correlate
credit scores with default rates. Regress repayment probability on applicant
characteristics.
- HR manager: Survey your team (or extract from your HRIS) on
attendance, performance scores, tenure, and training hours. EDA distributions.
Test whether performance differs by department. Correlate training hours with
ratings. Regress performance on tenure and training investment.
- FMCG sales representative: Record weekly sales across your
territory by product and outlet type. EDA sales distributions and seasonal
patterns. Test whether sales differ by outlet type. Correlate promotional
spend with uplift. Regress sales on price and promotion.
1.4 Data Requirements
- Minimum 100 observations; 6 variables (at least 3 numeric, 2 categorical,
1 date or time variable).
- Variables must include at least one outcome/dependent variable and several
predictors that make substantive business sense.
- If your primary data has fewer than 100 rows, supplement with a second collection
period or related source — but document this clearly in Section 3.
1.5 Specific Deliverables
- An EDA section that identifies at least 2 data quality issues and explains how
you handled them (missing values, outliers, skewness).
- A visualisation narrative: at least 5 plots in a cohesive layout that tells a
single story about your dataset.
- Hypothesis testing: formulate at least 2 hypotheses, state H₀ and H₁, check
assumptions, run the test, report p-value and effect size, and interpret in
plain business language.
- A correlation matrix with heatmap; discuss the 2–3 strongest correlations and
their business implications.
- A regression model with diagnostic plots; interpret each significant coefficient
as a concrete business action.
1.6 Guiding Questions
- What does the distribution of your key outcome variable tell you about the
business process that generated it?
- Which visualisation type best communicates the most important pattern in your
data, and why did you choose it over alternatives?
- What would a statistically significant result in your hypothesis test mean for
a decision your organisation faces right now?
- Which correlation in your data is most plausibly causal, and how would you
design a test to confirm or refute that causality?
- How would you translate your regression coefficient into a recommendation for
a non-technical manager?
Case Study 2 — Predictive Modelling & Segmentation
Theme: Building models that predict outcomes and discover hidden groups
• Total marks: 100
2.1 Overview
Machine learning transforms data into decisions. In this case study you will move
beyond description and inference to build predictive and segmentation models on your
own data. You will demonstrate that you understand not just how to run a model, but
how to evaluate it, explain it, and connect its output to a concrete business action.
You will also show that unsupervised techniques can reveal structure that supervised
methods assume away.
2.2 Required Techniques
| # | Technique | Book Reference |
| 1 | Classification Model |
Ch. 12–15 — Logistic regression, decision tree, random forest, or XGBoost |
| 2 | Model Evaluation & Explainability |
Ch. 12, 16 — Confusion matrix, ROC/AUC, SHAP values, LIME, feature importance |
| 3 | Customer/Entity Segmentation (Clustering) |
Ch. 19–21 — K-Means, hierarchical, DBSCAN; silhouette score; cluster profiling |
| 4 | Dimensionality Reduction |
Ch. 22 — PCA, t-SNE, or UMAP; biplot; variance explained |
| 5 | Time Series Analysis |
Ch. 23–24 — Decomposition, stationarity test, ARIMA or ETS forecast |
2.3 Business Context Examples (illustrative only)
- Retail bank: Collect customer transaction data. Build a classifier
to predict whether a customer will default or churn in the next 90 days. Explain
the model to a credit committee with SHAP. Cluster customers into risk/value
segments. Use PCA to visualise the segment landscape. Forecast monthly transaction
volumes with ARIMA.
- Supply chain manager: Collect SKU-level demand and delivery data.
Classify deliveries as on-time or late. Explain which supplier variables drive
delays. Cluster SKUs by demand pattern. Reduce the feature space with PCA for
visualisation. Forecast demand for the top-10 SKUs for the next quarter.
- Hospital administrator: Collect patient admission records. Predict
30-day readmission. Explain the model with SHAP for clinical staff. Cluster
patients by risk profile. Use PCA on clinical indicators. Forecast weekly
bed demand.
2.4 Data Requirements
- Classification: minimum 200 observations, a binary or multi-class outcome variable,
and at least 6 predictor variables.
- Time series: minimum 24 time periods (weekly or monthly). If your data is daily,
aggregate to weekly for this exercise.
- Clustering: the same dataset as classification is acceptable — segment observations
independently of the outcome variable.
2.5 Specific Deliverables
- A classification pipeline: train/test split (or cross-validation), at least two
model types compared, ROC curve, confusion matrix, and a deployment recommendation
(which model and why).
- A SHAP summary plot and waterfall plot for one representative prediction; explain
each top-5 feature in plain language.
- A cluster analysis with optimal-k justification (elbow + silhouette), cluster
profile table, and a naming/labelling exercise for each cluster.
- A PCA biplot showing where clusters sit in the reduced feature space.
- A time series decomposition plot, ACF/PACF analysis, and a 3-period forecast
with prediction intervals.
2.6 Guiding Questions
- Which model architecture performed best, and does the performance difference
justify the added complexity?
- If presenting to a non-technical board, which SHAP output would you show and
how would you explain it?
- What do your clusters reveal about heterogeneity that aggregate statistics
would hide?
- How would you use cluster membership as a feature in your classification model
(target encoding, dummy variables)?
- Is your time series stationary? What transformation — if any — was required
and why does stationarity matter for ARIMA?
Case Study 3 — Advanced & Operational Analytics
Theme: Specialised methods for text, risk, forecasting, and optimisation
• Total marks: 100
3.1 Overview
The frontier of business analytics extends well beyond structured tables of numbers.
Organisations generate text — customer complaints, employee surveys, board minutes,
social media mentions. Decisions involve risk and uncertainty that can be quantified
through simulation. Operational systems need demand forecasts and optimal allocation
of constrained resources. In this case study you apply five such advanced methods
to your own organisational data.
3.2 Required Techniques
| # | Technique | Book Reference |
| 1 | Text Analytics & Sentiment Analysis |
Ch. 27–28 — TF-IDF, bag-of-words, VADER/AFINN sentiment, topic modelling (LDA) |
| 2 | Monte Carlo Simulation |
Ch. 55 — Distribution fitting, simulation workflow, P10/P50/P90, VaR, tornado chart |
| 3 | Advanced Forecasting |
Ch. 25–26 — Prophet, LightGBM features, walk-forward CV, or hierarchical forecasting |
| 4 | Customer / People Analytics |
Ch. 40–44 or Ch. 53–54 — RFM, CLV, churn, survival analysis, attrition drivers |
| 5 | Optimisation or Association Rules |
Ch. 18 (Apriori / FP-Growth / market basket) or Ch. 49 (LP / EOQ / transportation) |
3.3 Business Context Examples (illustrative only)
- Marketing manager: Collect customer reviews or NPS open-text
responses. Analyse sentiment and extract topics with LDA. Simulate the uncertainty
in your campaign ROI with Monte Carlo. Forecast next quarter's revenue with Prophet.
Compute CLV for your customer base. Apply market basket analysis to identify
cross-sell opportunities.
- Operations / logistics manager: Collect delivery records, incident
logs, and demand data. Analyse the text of incident reports. Simulate procurement
cost uncertainty. Forecast demand hierarchically. Model driver retention with
survival analysis. Solve a vehicle routing or inventory-allocation LP.
- HR / talent manager: Collect exit interview text, performance
reviews, and headcount data. Sentiment-analyse exit text. Simulate annual attrition
under different retention-intervention cost scenarios. Forecast headcount demand.
Model attrition with Cox proportional-hazards. Optimise training-budget allocation
with LP.
3.4 Data Requirements
- Text data: minimum 50 text documents (survey responses, complaint
tickets, emails, meeting notes, social media posts).
- Simulation: identify at least 3 uncertain inputs relevant to a
real financial decision. Fit probability distributions to historical data — do not
assume normality without testing.
- Forecasting: minimum 36 time periods for Prophet; minimum 52
weekly periods for LightGBM features. Justify your forecast horizon.
- Customer/People analytics: minimum 100 individuals with at least
one event variable (churn date, attrition date, last purchase).
- Optimisation: formulate the LP or association-rule problem from
actual operational constraints — your warehouse capacities, budget limits, or
product catalogue.
3.5 Specific Deliverables
- A text preprocessing pipeline (lowercasing, stop words, stemming/lemmatisation)
and a TF-IDF visualisation showing the 20 most distinctive terms.
- A sentiment trend chart (over time or by category) with business interpretation.
- An LDA topic model with coherence-based K selection; name each topic and explain
what organisational behaviour it reflects.
- A Monte Carlo simulation with 10,000 runs: histogram of outcomes, P10/P50/P90
table, and a tornado sensitivity chart.
- A forecast with walk-forward cross-validation RMSE and a 3–6 month forward
projection with confidence bands.
- A survival curve (Kaplan-Meier) or CLV table segmented by a meaningful business
dimension (product, channel, or department).
- Either: association-rules output (top rules by lift with business interpretation)
or an LP solution with shadow prices interpreted as managerial insight.
3.6 Guiding Questions
- What does the sentiment trend reveal about the trajectory of customer or employee
experience in your organisation?
- Which topics emerged from LDA that were not visible in quantitative surveys — and
what does that suggest about your measurement instruments?
- In your Monte Carlo simulation, which uncertain input has the greatest impact on
the outcome? How should that drive your risk management priorities?
- What is the probability that the project you simulated generates a negative return?
At what input value does it break even?
- Which customer segment has the highest CLV but also the highest churn risk — and
what retention action does your analytics support?
Submission, Data Privacy & Honour Code
4.1 Submission Instructions
- Render your Quarto document to HTML with
self-contained: true.
- Publish to RPubs: in RStudio click
Publish → RPubs. Copy the resulting URL.
- Or publish to Posit Connect Cloud from Positron:
quarto publish connect.
- Submit the live URL (not a file) via the LMS by the stated deadline.
- For the GitHub bonus: create a public repo, commit your
.qmd and
data files (anonymise if needed), and include the repo URL in your submission.
- Each student must submit independently. Shared datasets, identical
code, or near-identical documents will be flagged as collusion.
Deadline: As announced on the LMS. Late submissions lose 5 marks per
day. Extensions require documentation submitted at least 48 hours before the deadline.
Contact Prof Bongo Adi at
badi@lbs.edu.ng.
4.2 Defence / Viva Voce (Section C — 20 marks)
Approximately one week after the submission deadline, each student
will attend a short individual defence session with Prof Bongo Adi or a designated
examiner. The session lasts approximately 10–15 minutes.
- Bring your published HTML document open in a browser — no additional slides required.
- Be prepared to: walk through your data collection process, explain why you chose each
technique, interpret any output the examiner points to, and discuss what you would
do differently.
- The defence schedule will be communicated via the LMS within 48 hours of the
submission deadline.
- Failure to attend without a documented medical or compassionate reason will result
in zero marks for Section C (20 marks).
- The defence also serves as verification of independent work. Students who cannot
credibly explain their own submission may have their overall grade reviewed.
Tip for defence preparation: For each of your five techniques, be ready
to answer: (1) Why this technique for this data? (2) What do the key numbers mean?
(3) What is the single most important business implication? (4) What assumption might be
violated and how would that affect your conclusion?
4.3 Data Privacy & Ethics
If your data contains personally identifiable information (PII) — names, employee IDs,
customer account numbers — you must anonymise before submission. Replace identifiers with
codes (Customer_001, Employee_A, etc.). Do not publish raw financial data that your
organisation treats as confidential. If in doubt, obtain written permission from your
organisation before submitting and include a copy in an appendix.
4.4 Academic Integrity & AI Usage
You may use AI coding assistants (GitHub Copilot, Claude, ChatGPT) to help write code,
but the analytical decisions — which technique, which model, how to interpret the output,
what to recommend — must be yours. Include a brief AI usage statement
at the end of your document (one paragraph) describing what you used AI for and where you
made independent judgements. Presenting AI-generated interpretation as your own without
disclosure constitutes academic misconduct.
4.5 Useful Resources
Bibliography & Citation Guide
All submitted work must cite sources in APA 7th edition format.
The minimum required citations are listed below. Add further references as your
analysis demands.
Course Textbook (required citation in every submission)
Adi, B. (2026).
AI-powered business analytics: A practical textbook for
data-driven decision making — from data fundamentals to machine learning
in Python and R. Lagos Business School / markanalytics.online.
https://markanalytics.online
Software & Package Citations
R and Python packages must be cited. Use the commands below to retrieve the
correct citation for each package you use, then format in APA style.
| Software / Package | APA 7th Citation (or how to retrieve it) |
| R language |
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/ |
| Python |
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. (For specific version, cite platform.python_version() output.) |
| Any R package |
Run citation("packagename") in R and copy the BibTeX or text output. Convert to APA 7 format. |
| Any Python package |
Cite via the package's official documentation or JOSS/PyPI entry. Include version: import pkg; pkg.__version__. |
| tidyverse |
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686 |
| scikit-learn |
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. |
| ggplot2 |
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4 |
| pandas |
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a |
| Prophet |
Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080 |
| SHAP |
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates. |
| Quarto |
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048 |
How to Cite Your Data Source
Every dataset you use must be cited. Use the appropriate template below:
Primary data you collected:
[Your Name]. (2026). [Descriptive title of dataset] [Dataset]. Collected from
[Organisation/Department], [City, Nigeria]. Data available on request from the author.
Organisational records / internal report:
[Organisation Name]. (Year). [Title of report or data extract] [Internal data].
[Department], [Organisation].
Survey data:
[Your Name]. (2026). [Survey title] [Survey instrument and dataset].
Administered to [population description], [Month Year]. Ethical clearance: [details or N/A].