AI-Powered Data Analytics II

Lagos Business School • April 2026

Capstone Case Study

Data Analytics II

Assessment Brief

Prof Bongo Adi • Lagos Business School

⚠️ SELECT ONE CASE STUDY ONLY

Three case studies are provided below. You are required to complete exactly one (1) of them. Read all three before choosing — select the case that best fits the data you can realistically collect from your own professional context.

Submitting more than one case study will result in only the first submitted being marked.

Assessment weight: This case study carries a total of 100 marks, broken into three components: 30 marks for real-data generation and business-operations mapping, 50 marks for the submitted analytical work, and 20 marks for a defence held approximately one week after submission. The 30-mark practical component compensates for the practical assignment portion of the course. This is an individual assignment — each student must submit and defend their own independent work. Submission format: Live HTML published on RPubs or Posit Connect Cloud. Tool: Quarto in RStudio or Positron.

👥 INDIVIDUAL ASSIGNMENT

Each student must independently collect their own data, conduct their own analysis, and submit their own Quarto document. Identical or near-identical submissions, shared datasets presented as independent, or any form of collusion will be treated as academic misconduct. You will be required to defend your work in person — you must be able to explain every line of code and every result.

#	Theme	Five Techniques	Marks (A + B + Defence)
CS 1	Exploratory & Inferential Analytics	EDA · Visualisation · Hypothesis Testing · Correlation · Regression	100 (30 + 50 + 20)
CS 2	Predictive Modelling & Segmentation	Classification · Explainability · Clustering · Dimensionality Reduction · Time Series	100 (30 + 50 + 20)
CS 3	Advanced & Operational Analytics	Text Analytics · Monte Carlo · Advanced Forecasting · Customer/People Analytics · Optimisation or Association Rules	100 (30 + 50 + 20)

Each submission must use real data collected from your own workplace, professional practice, or a Nigerian/African organisation you have direct access to. Simulated or publicly downloaded datasets may only supplement primary data — they cannot replace it.

General Instructions & Assessment Philosophy

This assessment asks you to act as a practising data scientist tackling a real problem in your own organisation or sector. You will collect or extract genuine data, frame an analytical question, apply five techniques from the textbook, and communicate your findings through a reproducible Quarto document.

1.1 The Real-Data Requirement 30 marks

The defining feature of this assessment — and the source of the 30-mark practical component — is the use of real data that clearly maps to your business operations. These 30 marks are awarded for the quality, transparency, and operational relevance of your data, not for the analyses themselves. Each submission must include:

A professional disclosure statement (Section 2 of your document) explaining your job role, your organisation's domain, and why the five techniques you chose are directly relevant to your day-to-day work.
At least one primary dataset you collected yourself — from your organisation's systems, a survey you designed and administered, web scraping, or direct field observation. The dataset must contain a minimum of 100 observations and at least 5 variables.
A data provenance section describing: how the data were collected, the sampling frame, the time period covered, ethical approvals or consent obtained, and any data-sharing restrictions that affect what you can publish.
A technique justification paragraph for each of the five methods, explaining why it is the appropriate technique given your specific business context and data structure.

Academic integrity: Submitting simulated data presented as real data is academic misconduct and will result in a grade of zero for the entire submission. Prof Bongo Adi may ask you to present your data collection process during a viva voce.

30-Mark Practical Component — Breakdown

Real-Data & Business-Operations Component	Marks
Primary data collection: documented methodology, source, tools used	10
Sampling justification: sample frame, sample size, period, and statistical rationale	10
Clear mapping to your business operations: each technique linked to a real decision or process in your organisation	10
SECTION A TOTAL	30

1.2 Quarto Document Requirements

Your submission is a single Quarto (.qmd) document rendered to HTML. It must:

Open with a YAML header specifying title, author, date, format: html, toc: true, code-fold: true, self-contained: true.
Use R and/or Python code chunks. A panel-tabset showing both languages is strongly encouraged and attracts higher marks within the 100-mark component.
Include all data loading, cleaning, and analysis code — the document must be fully reproducible from top to bottom.
Embed all visualisations inline; no external image files.
Be published as a live HTML URL on RPubs (rpubs.com/publish from RStudio) or Posit Connect Cloud (connect.posit.cloud from Positron).
Optionally: push the .qmd source and data files to a public GitHub repository and include the repo URL in your submission — this attracts up to +5 bonus marks.

A minimal YAML header:

---
title: "[Your Case Study Title]"
author: "[Your Full Name]"
date: today
format:
  html:
    theme: flatly
    toc: true
    code-fold: true
    self-contained: true
---

1.3 Standard Document Structure

Section	Content Required
1. Executive Summary	150–200 words: business problem, data collected, key findings, and recommendation.
2. Professional Disclosure	Your job title, organisation type/sector, and a paragraph for each technique explaining its operational relevance to your work. (Assessed in the 30-mark component.)
3. Data Collection & Sampling	Source, collection method, sampling frame, sample size, time period covered, and ethical notes or consent statement. (Assessed in the 30-mark component.)
4. Data Description	Variable names, types, and distributions produced with EDA code.
5–9. Analysis (one section per technique)	For each technique: brief theory recap, business justification, code, output, and plain-language interpretation for a non-technical manager.
10. Integrated Findings	How do the five analyses fit together? What single recommendation do they collectively support?
11. Limitations & Further Work	What would you do differently with more data, time, or computing power?
References	APA format. Cite the textbook, R/Python packages (use `citation("pkgname")`), and data sources.
Appendix: AI Usage Statement	One paragraph describing which AI tools (if any) assisted with coding, and where you exercised independent analytical judgement.

1.4 Full Marking Scheme (100 marks total)

Section A — Real-Data & Business-Operations Component (30 marks)

See Section 1.1 breakdown above.

Section B — Submitted Analytical Work (50 marks)

Analytical Component	Marks
Professional disclosure quality and depth of context linkage	5
Correct and appropriate application of each of the five techniques (5 × 6 marks)	30
Depth of business interpretation per technique	8
Code quality, reproducibility, and document structure	4
Integrated conclusion and actionable recommendation	3
SECTION B TOTAL	50

Section C — Defence / Viva Voce (20 marks) — held approximately one week after submission

The defence is a short individual oral examination (approx. 10–15 minutes) conducted by Prof Bongo Adi or a designated examiner. You will be asked to explain your data collection process, justify your analytical choices, and interpret selected outputs on the spot. No slides are required — bring your live HTML document. The defence confirms that the submitted work is genuinely your own.

Defence Component	Marks
Ability to explain analytical decisions and justify technique selection	8
Correct interpretation of model outputs and statistical results under questioning	8
Demonstrated ownership: evidence that the data, code, and conclusions are genuinely yours	4
SECTION C TOTAL	20

Grand Summary	Marks
Section A — Real-data generation & business-operations mapping	30
Section B — Submitted analytical work	50
Section C — Defence / viva voce (~1 week after submission)	20
GitHub repository (bonus)	+5
GRAND TOTAL (excl. bonus)	100

Case Study 1 — Exploratory & Inferential Analytics

Theme: Understanding the story in your data before building models • Total marks: 100

1.1 Overview

This case study focuses on the first and most important phase of any analysis: understanding what you have. Before fitting models, a rigorous analyst spends significant time on exploratory data analysis, visualisation, and formal statistical testing. You will apply these foundational techniques to data from your own professional context and demonstrate that you can move fluently between exploratory insight and inferential conclusion.

1.2 Required Techniques

#	Technique	Book Reference
1	Exploratory Data Analysis (EDA)	Ch. 4 — Summary stats, missing-value analysis, outlier detection, Anscombe's Quartet
2	Data Visualisation	Ch. 5 — Grammar of graphics, chart selection, storytelling with data
3	Hypothesis Testing	Ch. 6 — t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes
4	Correlation Analysis	Ch. 8 — Pearson, Spearman, Kendall; partial correlation; correlation vs causation
5	Linear or Logistic Regression	Ch. 9 (OLS) or Ch. 13 (logistic) — coefficients, diagnostics, interpretation

1.3 Business Context Examples (illustrative only — use your own)

Bank credit officer: Collect anonymised loan application data from your team's portfolio. EDA the distribution of loan amounts and repayment status. Test whether interest rates differ significantly across loan categories. Correlate credit scores with default rates. Regress repayment probability on applicant characteristics.
HR manager: Survey your team (or extract from your HRIS) on attendance, performance scores, tenure, and training hours. EDA distributions. Test whether performance differs by department. Correlate training hours with ratings. Regress performance on tenure and training investment.
FMCG sales representative: Record weekly sales across your territory by product and outlet type. EDA sales distributions and seasonal patterns. Test whether sales differ by outlet type. Correlate promotional spend with uplift. Regress sales on price and promotion.

1.4 Data Requirements

Minimum 100 observations; 6 variables (at least 3 numeric, 2 categorical, 1 date or time variable).
Variables must include at least one outcome/dependent variable and several predictors that make substantive business sense.
If your primary data has fewer than 100 rows, supplement with a second collection period or related source — but document this clearly in Section 3.

1.5 Specific Deliverables

An EDA section that identifies at least 2 data quality issues and explains how you handled them (missing values, outliers, skewness).
A visualisation narrative: at least 5 plots in a cohesive layout that tells a single story about your dataset.
Hypothesis testing: formulate at least 2 hypotheses, state H₀ and H₁, check assumptions, run the test, report p-value and effect size, and interpret in plain business language.
A correlation matrix with heatmap; discuss the 2–3 strongest correlations and their business implications.
A regression model with diagnostic plots; interpret each significant coefficient as a concrete business action.

1.6 Guiding Questions

What does the distribution of your key outcome variable tell you about the business process that generated it?
Which visualisation type best communicates the most important pattern in your data, and why did you choose it over alternatives?
What would a statistically significant result in your hypothesis test mean for a decision your organisation faces right now?
Which correlation in your data is most plausibly causal, and how would you design a test to confirm or refute that causality?
How would you translate your regression coefficient into a recommendation for a non-technical manager?

Case Study 2 — Predictive Modelling & Segmentation

Theme: Building models that predict outcomes and discover hidden groups • Total marks: 100

2.1 Overview

Machine learning transforms data into decisions. In this case study you will move beyond description and inference to build predictive and segmentation models on your own data. You will demonstrate that you understand not just how to run a model, but how to evaluate it, explain it, and connect its output to a concrete business action. You will also show that unsupervised techniques can reveal structure that supervised methods assume away.

2.2 Required Techniques

#	Technique	Book Reference
1	Classification Model	Ch. 12–15 — Logistic regression, decision tree, random forest, or XGBoost
2	Model Evaluation & Explainability	Ch. 12, 16 — Confusion matrix, ROC/AUC, SHAP values, LIME, feature importance
3	Customer/Entity Segmentation (Clustering)	Ch. 19–21 — K-Means, hierarchical, DBSCAN; silhouette score; cluster profiling
4	Dimensionality Reduction	Ch. 22 — PCA, t-SNE, or UMAP; biplot; variance explained
5	Time Series Analysis	Ch. 23–24 — Decomposition, stationarity test, ARIMA or ETS forecast

2.3 Business Context Examples (illustrative only)

Retail bank: Collect customer transaction data. Build a classifier to predict whether a customer will default or churn in the next 90 days. Explain the model to a credit committee with SHAP. Cluster customers into risk/value segments. Use PCA to visualise the segment landscape. Forecast monthly transaction volumes with ARIMA.
Supply chain manager: Collect SKU-level demand and delivery data. Classify deliveries as on-time or late. Explain which supplier variables drive delays. Cluster SKUs by demand pattern. Reduce the feature space with PCA for visualisation. Forecast demand for the top-10 SKUs for the next quarter.
Hospital administrator: Collect patient admission records. Predict 30-day readmission. Explain the model with SHAP for clinical staff. Cluster patients by risk profile. Use PCA on clinical indicators. Forecast weekly bed demand.

2.4 Data Requirements

Classification: minimum 200 observations, a binary or multi-class outcome variable, and at least 6 predictor variables.
Time series: minimum 24 time periods (weekly or monthly). If your data is daily, aggregate to weekly for this exercise.
Clustering: the same dataset as classification is acceptable — segment observations independently of the outcome variable.

2.5 Specific Deliverables

A classification pipeline: train/test split (or cross-validation), at least two model types compared, ROC curve, confusion matrix, and a deployment recommendation (which model and why).
A SHAP summary plot and waterfall plot for one representative prediction; explain each top-5 feature in plain language.
A cluster analysis with optimal-k justification (elbow + silhouette), cluster profile table, and a naming/labelling exercise for each cluster.
A PCA biplot showing where clusters sit in the reduced feature space.
A time series decomposition plot, ACF/PACF analysis, and a 3-period forecast with prediction intervals.

2.6 Guiding Questions

Which model architecture performed best, and does the performance difference justify the added complexity?
If presenting to a non-technical board, which SHAP output would you show and how would you explain it?
What do your clusters reveal about heterogeneity that aggregate statistics would hide?
How would you use cluster membership as a feature in your classification model (target encoding, dummy variables)?
Is your time series stationary? What transformation — if any — was required and why does stationarity matter for ARIMA?

Case Study 3 — Advanced & Operational Analytics

Theme: Specialised methods for text, risk, forecasting, and optimisation • Total marks: 100

3.1 Overview

The frontier of business analytics extends well beyond structured tables of numbers. Organisations generate text — customer complaints, employee surveys, board minutes, social media mentions. Decisions involve risk and uncertainty that can be quantified through simulation. Operational systems need demand forecasts and optimal allocation of constrained resources. In this case study you apply five such advanced methods to your own organisational data.

3.2 Required Techniques

#	Technique	Book Reference
1	Text Analytics & Sentiment Analysis	Ch. 27–28 — TF-IDF, bag-of-words, VADER/AFINN sentiment, topic modelling (LDA)
2	Monte Carlo Simulation	Ch. 55 — Distribution fitting, simulation workflow, P10/P50/P90, VaR, tornado chart
3	Advanced Forecasting	Ch. 25–26 — Prophet, LightGBM features, walk-forward CV, or hierarchical forecasting
4	Customer / People Analytics	Ch. 40–44 or Ch. 53–54 — RFM, CLV, churn, survival analysis, attrition drivers
5	Optimisation or Association Rules	Ch. 18 (Apriori / FP-Growth / market basket) or Ch. 49 (LP / EOQ / transportation)

3.3 Business Context Examples (illustrative only)

Marketing manager: Collect customer reviews or NPS open-text responses. Analyse sentiment and extract topics with LDA. Simulate the uncertainty in your campaign ROI with Monte Carlo. Forecast next quarter's revenue with Prophet. Compute CLV for your customer base. Apply market basket analysis to identify cross-sell opportunities.
Operations / logistics manager: Collect delivery records, incident logs, and demand data. Analyse the text of incident reports. Simulate procurement cost uncertainty. Forecast demand hierarchically. Model driver retention with survival analysis. Solve a vehicle routing or inventory-allocation LP.
HR / talent manager: Collect exit interview text, performance reviews, and headcount data. Sentiment-analyse exit text. Simulate annual attrition under different retention-intervention cost scenarios. Forecast headcount demand. Model attrition with Cox proportional-hazards. Optimise training-budget allocation with LP.

3.4 Data Requirements

Text data: minimum 50 text documents (survey responses, complaint tickets, emails, meeting notes, social media posts).
Simulation: identify at least 3 uncertain inputs relevant to a real financial decision. Fit probability distributions to historical data — do not assume normality without testing.
Forecasting: minimum 36 time periods for Prophet; minimum 52 weekly periods for LightGBM features. Justify your forecast horizon.
Customer/People analytics: minimum 100 individuals with at least one event variable (churn date, attrition date, last purchase).
Optimisation: formulate the LP or association-rule problem from actual operational constraints — your warehouse capacities, budget limits, or product catalogue.

3.5 Specific Deliverables

A text preprocessing pipeline (lowercasing, stop words, stemming/lemmatisation) and a TF-IDF visualisation showing the 20 most distinctive terms.
A sentiment trend chart (over time or by category) with business interpretation.
An LDA topic model with coherence-based K selection; name each topic and explain what organisational behaviour it reflects.
A Monte Carlo simulation with 10,000 runs: histogram of outcomes, P10/P50/P90 table, and a tornado sensitivity chart.
A forecast with walk-forward cross-validation RMSE and a 3–6 month forward projection with confidence bands.
A survival curve (Kaplan-Meier) or CLV table segmented by a meaningful business dimension (product, channel, or department).
Either: association-rules output (top rules by lift with business interpretation) or an LP solution with shadow prices interpreted as managerial insight.

3.6 Guiding Questions

What does the sentiment trend reveal about the trajectory of customer or employee experience in your organisation?
Which topics emerged from LDA that were not visible in quantitative surveys — and what does that suggest about your measurement instruments?
In your Monte Carlo simulation, which uncertain input has the greatest impact on the outcome? How should that drive your risk management priorities?
What is the probability that the project you simulated generates a negative return? At what input value does it break even?
Which customer segment has the highest CLV but also the highest churn risk — and what retention action does your analytics support?

Submission, Data Privacy & Honour Code

4.1 Submission Instructions

Render your Quarto document to HTML with self-contained: true.
Publish to RPubs: in RStudio click Publish → RPubs. Copy the resulting URL.
Or publish to Posit Connect Cloud from Positron: quarto publish connect.
Submit the live URL (not a file) via the LMS by the stated deadline.
For the GitHub bonus: create a public repo, commit your .qmd and data files (anonymise if needed), and include the repo URL in your submission.
Each student must submit independently. Shared datasets, identical code, or near-identical documents will be flagged as collusion.

Deadline: As announced on the LMS. Late submissions lose 5 marks per day. Extensions require documentation submitted at least 48 hours before the deadline. Contact Prof Bongo Adi at badi@lbs.edu.ng.

4.2 Defence / Viva Voce (Section C — 20 marks)

Approximately one week after the submission deadline, each student will attend a short individual defence session with Prof Bongo Adi or a designated examiner. The session lasts approximately 10–15 minutes.

Bring your published HTML document open in a browser — no additional slides required.
Be prepared to: walk through your data collection process, explain why you chose each technique, interpret any output the examiner points to, and discuss what you would do differently.
The defence schedule will be communicated via the LMS within 48 hours of the submission deadline.
Failure to attend without a documented medical or compassionate reason will result in zero marks for Section C (20 marks).
The defence also serves as verification of independent work. Students who cannot credibly explain their own submission may have their overall grade reviewed.

Tip for defence preparation: For each of your five techniques, be ready to answer: (1) Why this technique for this data? (2) What do the key numbers mean? (3) What is the single most important business implication? (4) What assumption might be violated and how would that affect your conclusion?

4.3 Data Privacy & Ethics

If your data contains personally identifiable information (PII) — names, employee IDs, customer account numbers — you must anonymise before submission. Replace identifiers with codes (Customer_001, Employee_A, etc.). Do not publish raw financial data that your organisation treats as confidential. If in doubt, obtain written permission from your organisation before submitting and include a copy in an appendix.

4.4 Academic Integrity & AI Usage

You may use AI coding assistants (GitHub Copilot, Claude, ChatGPT) to help write code, but the analytical decisions — which technique, which model, how to interpret the output, what to recommend — must be yours. Include a brief AI usage statement at the end of your document (one paragraph) describing what you used AI for and where you made independent judgements. Presenting AI-generated interpretation as your own without disclosure constitutes academic misconduct.

4.5 Useful Resources

Resource	URL
Course textbook (see bibliography)	markanalytics.online
Quarto documentation	quarto.org/docs
RPubs publishing	rpubs.com — sign up free, publish from RStudio in one click
Posit Connect Cloud	connect.posit.cloud — free tier for students
GitHub Student Pack	education.github.com/pack — free Pro account
NBS Nigeria	nigerianstat.gov.ng — household survey & labour force data
CBN Statistics	cbn.gov.ng/documents/statbulletin.asp
Paystack Developer API	developers.paystack.com — fintech transaction data

Bibliography & Citation Guide

All submitted work must cite sources in APA 7th edition format. The minimum required citations are listed below. Add further references as your analysis demands.

Course Textbook (required citation in every submission)

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Software & Package Citations

R and Python packages must be cited. Use the commands below to retrieve the correct citation for each package you use, then format in APA style.

Software / Package	APA 7th Citation (or how to retrieve it)
R language	R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Python	Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. (For specific version, cite `platform.python_version()` output.)
Any R package	Run `citation("packagename")` in R and copy the BibTeX or text output. Convert to APA 7 format.
Any Python package	Cite via the package's official documentation or JOSS/PyPI entry. Include version: `import pkg; pkg.__version__`.
tidyverse	Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
scikit-learn	Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
ggplot2	Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
pandas	McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Prophet	Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080
SHAP	Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates.
Quarto	Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

How to Cite Your Data Source

Every dataset you use must be cited. Use the appropriate template below:

Primary data you collected:
[Your Name]. (2026). [Descriptive title of dataset] [Dataset]. Collected from [Organisation/Department], [City, Nigeria]. Data available on request from the author.

Organisational records / internal report:
[Organisation Name]. (Year). [Title of report or data extract] [Internal data]. [Department], [Organisation].

Survey data:
[Your Name]. (2026). [Survey title] [Survey instrument and dataset]. Administered to [population description], [Month Year]. Ethical clearance: [details or N/A].