Topic One: Introduction to Data Analytics

Learning Objectives

By the end of this topic, students should be able to:

Define data analytics
Explain the role of data analytics in organizations
Describe the data analytics lifecycle
Differentiate between descriptive, diagnostic, predictive, and prescriptive analytics

1.1 What is Data Analytics?

Data analytics is the process of collecting, cleaning, transforming, and analyzing data to generate insights and support decision-making. It bridges the gap between raw data and actionable insight. Data analytics is one of the most transformative disciplines in modern business and science.

Main purpose:

Understand past events
Explain causes
Predict future outcomes
Recommend actions

Why organizations use it:

Helps in decision-making (data > guesswork)
Improves efficiency (find and fix problems)
Understands customers better
Detects risks & fraud
Supports planning for the future
Tracks performance (KPIs)

📝 In short:

Data analytics turns raw data → useful insights → better actions

1.2 Data Analytics in the Context of DS, ML, and AI

Relationship:

Data Analytics → extracts insights from data
Data Science → broader field combining statistics, programming, and domain knowledge
Machine Learning → builds models that learn from data
Artificial Intelligence → enables intelligent systems and automated decisions

Key idea:

Data analytics provides the foundation for ML and AI by preparing and interpreting data.

1.3 Why Data Analytics Matters in Organizations

Organizations today generate and have access to massive volumes of data. The ability to harness this data for actionable insight provides a strategic competitive advantage.

Key Reasons Organizations Invest in Data Analytics

#	Reason	Description
1	Informed Decision-Making	Replace gut-feeling decisions with evidence-based strategies
2	Cost Reduction	Identify inefficiencies, optimize supply chains, reduce waste
3	Revenue Growth	Discover new market opportunities, optimize pricing, upselling
4	Risk Management	Detect fraud, predict failures, assess credit risk
5	Customer Understanding	Segment customers, personalize experiences, predict churn
6	Operational Efficiency	Streamline processes, predictive maintenance, workforce optimization
7	Innovation	Identify trends, develop new products, explore new business models

Organizations use data analytics to:

Make better decisions
Improve efficiency
Understand customers
Measure performance
Gain competitive advantage

Example:

A retail company analyzes sales data to decide which products to restock.

1.4 Data Analytics Across Industries

Data analytics is not limited to technology companies. Its applications span virtually every sector:

Healthcare

Predicting disease outbreaks and patient readmission
Drug discovery and genomics research
Optimizing hospital resource allocation
Personalized treatment plans

Finance & Banking

Credit scoring and loan approval
Algorithmic trading
Fraud detection in real time
Anti-money laundering (AML) compliance

Retail & E-Commerce

Customer segmentation and targeting
Recommendation engines (e.g., Amazon, Netflix)
Demand forecasting and inventory management
Dynamic pricing strategies

Manufacturing

Predictive maintenance of equipment
Quality control through sensor data
Supply chain optimization
Digital twins for simulation

Government & Public Sector

Census data analysis and urban planning
Crime pattern analysis and predictive policing
Tax fraud detection
Public health monitoring

Telecommunications

Network optimization
Customer churn prediction
Sentiment analysis of customer feedback

Sports

Player performance analytics (e.g., Moneyball)
Injury prediction
Fan engagement optimization

1.5 Role of Data Analytics in Organizations

1. Better decision-making

Supports evidence-based decisions (Helps managers make informed decisions)
Reduces reliance on guesswork

📝 Example: Choosing the best product to invest in

2. Process improvement/Improving Efficiency

Finds bottlenecks and inefficiencies
Identifies waste, delays, and bottlenecks
Optimizes processes

📝 Example: Reducing delivery time in logistics

3. Customer understanding

Identifies customer needs and preferences
Analyzes customer behavior & preferences
Helps in customer segmentation

📝 Example: Recommending products/services

4. Risk management

Detects fraud, errors, and unusual patterns/behavior
Supports safer operations

📝 Example: Fraud detection in banking

5. Strategic Planning

Helps in long-term planning and forecasting
Identifies market trends

📝 Example: Deciding future business expansion

6. Performance Measurement

Tracks progress using KPIs
Evaluates success of strategies

📝 Example: Monitoring sales performance

More Organizational Uses by Departments using analytics:

Marketing: customer segmentation, campaign analysis
Finance: forecasting, fraud detection
Operations: inventory and supply chain optimization
HR: employee performance and retention
Healthcare: patient analysis and disease prediction

1.6 The Data-Driven Organization: Key Characteristics

Research by McKinsey, MIT, and others consistently shows that data-driven organizations share common traits:

Data is treated as a strategic asset, with governance and quality standards
Decision-making at all levels is supported (not replaced) by data
A culture of experimentation exists — hypotheses are tested, not just assumed
Cross-functional data teams collaborate with domain experts
Investment in data infrastructure: pipelines, warehouses, and visualization tools

1.7 Key Roles in a Data Analytics Team

Modern analytics organizations typically include the following roles, though boundaries often overlap:

Data Analyst: Explores and summarizes data, builds dashboards and reports, answers specific business questions using SQL, Excel, and BI tools.
Data Scientist: Builds statistical and machine learning models to generate predictions, uncover patterns, and automate decisions. Requires strong coding and math skills.
Data Engineer: Designs and maintains the pipelines and infrastructure that move, store, and process data reliably at scale.
ML Engineer: Production machine learning models — taking a data scientist’s prototype and deploying it into reliable, scalable systems.

1.8 Data Analytics Lifecycle

The data analytics lifecycle is the sequence of steps followed in an analytics project.

Typical stages:

Problem definition
Data collection
Data preparation
Data exploration
Analysis/modeling
Interpretation
Communication
Deployment
Monitoring

Step 1 – Problem Definition

Business Understanding

Every data project starts not with data, but with a question. This phase defines what success looks like.

Clearly articulate the business problem or opportunity
Define measurable objectives and KPIs (Key Performance Indicators)
Identify stakeholders and understand their constraints
Determine whether the problem requires analytics at all — sometimes the answer is simpler
Agree on how the output will be used and by whom

Before analyzing data, define:

What problem needs to be solved?
What is the business goal?
What decisions will depend on the analysis?
How will success be measured?

Example:

Reduce customer churn in a telecom company.

Step 2 – Data Collection

Gather relevant data from different sources:

Databases
Spreadsheets
Websites
Sensors
APIs
Surveys
Social media
Logs

Goal: Collect data relevant to the problem.

Data Understanding

Once the problem is defined, the team explores what data exists and what additional data may be needed.

Inventory available data sources (internal databases, APIs, third-party feeds)
Conduct Exploratory Data Analysis (EDA) to understand distributions, ranges, and patterns
Assess data quality: completeness, accuracy, consistency, timeliness
Identify missing values,outliers, and anomalies
Document data lineage and metadata

Step 3 – Data Cleaning & Preparation

Often the most time-consuming phase, data preparation (or ‘data wrangling’) transforms raw data into a form suitable for analysis. Industry surveys consistently report that data professionals spend 60–80% of their time on this phase.

👉 Raw data often contains problems.

Common tasks:

Data cleaning: handling missing values, correcting errors, removing duplicates, Handle outliers, Standardize formats
Feature engineering: creating new variables from existing ones to better capture patterns.
Data transformation: normalization, encoding categorical variables, log transforms.
Data integration: joining data from multiple sources into a unified dataset (Merge datasets).
Splitting data: creating training, validation, and test sets for modeling

Important note:

Poor-quality data leads to poor analysis.

Step 4 – Data Exploration

Purpose: Understand the structure and patterns in the data.

Activities:

Summary statistics
Visualizations
Correlation checks
Trend analysis
Detect anomalies

Output: Early insights and hypotheses

Step 5 – Data Analysis / Modeling

Methods used:

Statistical analysis
Dashboards and reporting
Forecasting
Machine learning models
Optimization

Goal: Answer the business question using appropriate techniques.

Modeling

With clean, prepared data, analysts and data scientists build the analytical or machine learning models that address the business question.

Select appropriate modeling techniques (regression, classification, clustering, time series, etc.)
Train models on training data
Tune hyperparameters and validate model choices
Compare models using appropriate metrics (accuracy, RMSE, AUC-ROC, F1-score, etc.)
Document model assumptions and limitations

Step 6 – Interpretation of Results

Ask:

What do the results mean?
Are the findings reliable?
Do they answer the original question?
What actions should follow?

Key point:

Results must be linked to business value.

Step 7 – Communication and Visualization

Results should be presented clearly using:

Charts
Graphs
Dashboards
Reports
Presentations

Goal:

Ensure stakeholders understand the findings and recommendations.

Step 8 – Deployment / Action

A model that isn’t deployed creates no value. This phase operationalizes the analytical output.

Put insights into practice:

Launch a marketing campaign
Update a business policy
Automate predictions in software
Improve inventory strategy
Integrate the model or insights into business processes, systems, or dashboards
Set up API endpoints, batch scoring pipelines, or real-time inference systems
Train end users and document how to interpret and act on model outputs
Establish monitoring systems to detect performance degradation over time

Key idea:

Analytics creates value only when used in decision-making.

Step 9 – Monitoring and Improvement

After deployment:

Track outcomes (Track model performance metrics in production over time)
Monitor model performance (Conduct periodic audits for bias and fairness)
Update data and methods(Retrain or update models as new data becomes available)
Improve based on feedback (Feed lessons learned back into future projects)
Detect concept drift (when the statistical relationship between inputs and outputs changes)

Note:

The lifecycle is iterative and continuous.

📝 Simple flow:

Problem → Data → Clean → Explore → Model → Evaluate → Deploy → Monitor

Overview of Popular Lifecycle Models

Model	Origin	Phases
CRISP-DM	Cross-Industry Standard Process for Data Mining (1996)	6 phases
TDSP	Microsoft Team Data Science Process	5 phases
KDD	Knowledge Discovery in Databases (Fayyad et al., 1996)	5 phases
SEMMA	SAS Institute	5 phases
OSEMN	Community-driven (O’Reilly)	5 phases

CRISP-DM is a structured framework used to guide data analytics and machine learning projects from problem definition to deployment.

CRISP-DM remains the most widely used framework in industry (adopted by ~70% of data science projects according to various surveys).

🧠 Simple Flow

Business → Data → Prepare → Model → Evaluate → Deploy

1.9 Types of Analytics

There are four major types of analytics:

Descriptive
Diagnostic
Predictive
Prescriptive

These types answer different questions.

Descriptive Analytics

Question answered:

👉 “What happened?”

Descriptive analytics is the most foundational and widely practiced form. It summarizes historical data to understand what has occurred in the past.

Goal: Provide a clear picture of past events and current state
Methods: Aggregation, summarization, pivot tables, basic visualization, dashboards,Reports, KPIs
Tools: Excel, Tableau, Power BI, SQL, Google Data Studio

Examples in practice:

A retailer’s monthly sales report showing revenue by product category and region
A hospital dashboard showing daily patient admissions, discharges, and average length of stay
A social media analytics report showing follower growth, engagement rates, and top-performing posts
Monthly sales report showing total revenue and top-selling products

Descriptive Analytics – Key Points

Features:

Focuses on past and current data
Provides summaries and trends
Easy to understand

Limitation:

Does not explain why something happened

📌Note: Descriptive analytics answers ‘what’ but not ‘why’. It is the starting point for almost all analytical work and essential for data literacy across the organization. Most reporting and BI functions are primarily descriptive.

Diagnostic Analytics

Question answered:

👉 “Why did it happen?”

Diagnostic analytics goes deeper, investigating the causes and contributing factors behind observed outcomes.

Goal: Understand the root cause of a past event or trend (Identify causes and contributing factors)
Methods: Drill-down analysis, correlation analysis, data mining, hypothesis testing, Comparisons across time or groups
Tools: SQL (with subqueries and window functions), Python/R, statistical tests

Examples in practice:

Investigating why Q3 revenue dropped — discovering a supply chain disruption coincided with competitor price cuts
Analyzing why customer satisfaction scores fell — tracing it to longer wait times in the call cer after a staffing reduction
Identifying why a marketing campaign underperformed — finding that the target audience was incorrectly segmented
Why did sales drop?

Diagnostic analysis may reveal:

Price increase
Reduced advertising
Supply chain issues
Seasonal trends

Key value: Helps explain problems and opportunities

📌 Note: Diagnostic analytics often involves correlation analysis, but analysts must be careful not to confuse correlation with causation. Statistical correlation shows two variables move together; it does not prove one causes the other. Establishing causation requires controlled experiments or causal inference techniques

Predictive Analytics

Question answered:

👉 “What is likely to happen?”

Predictive analytics uses historical patterns to forecast future outcomes. This is where machine learning becomes central.

Goal: Forecast likely future events with quantifiable confidence (Forecast future outcomes using historical data)
Methods: Regression, classification, time series forecasting, Machine learning
Tools: Python (scikit-learn, statsmodels, Prophet), R, Azure ML, AWS SageMaker

Examples in practice:

A bank predicting which customers are likely to default on loans in the next 90 days
A retailer forecasting demand for each product SKU by store location for the upcoming holiday season
A telecom company predicting which customers are at high risk of churning in the next 30 days
A manufacturer predicting when industrial equipment will fail, enabling proactive maintenance
Predict customer churn
Forecast next quarter sales
Estimate loan default risk
Predict disease risk in patients
Predicting disease outbreaks

Key concepts in predictive analytics:

Training data: Historical data used to fit (train) the model — the model learns patterns from this data.
Test data: Held-out data never seen during training, used to evaluate how well the model generalizes to new examples.
Overfitting: When a model learns the training data too well (including its noise) and fails to generalize to new data. A model that memorizes rather than learns.
Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.

Benefit:Enables proactive decision-making

Prescriptive Analytics

Question answered:

👉 “What should we do?”

The most advanced form of analytics, prescriptive analytics goes beyond prediction to recommend specific actions that optimize outcomes.

Goal: Identify the optimal action or decision given constraints and objectives (Recommend actions that improve outcomes)
Methods: Mathematical optimization, simulation, reinforcement learning, decision trees, scenario analysis
Tools: Python (SciPy, PuLP, OR-Tools), specialized optimization solvers, simulation platforms

Examples in practice:

An airline dynamically pricing seats to maximize revenue given demand forecasts, competitor prices, and remaining inventory
A logistics company routing delivery vehicles to minimize fuel costs and delivery times simultaneously
A hospital scheduling surgeries and staff to maximize throughput while meeting patient safety standards
An investment algorithm allocating portfolio weights to maximize expected return for a given risk tolerance
Recommend discounts to at-risk customers
Suggest the best delivery route
Optimize staffing levels
Recommend stock levels

Benefit: Supports action and optimization

💡 Key Insight: Prescriptive analytics is often where AI meets decision automation. When a system can not only predict what will happen but also automatically take the best action in response — without human intervention — it becomes a true AI-driven decision system. This is the cutting edge of enterprise analytics

Comparison of the Four Types

Type	Main Question	Focus	Example
Descriptive	What happened?	Summary of past data	Sales dashboard
Diagnostic	Why did it happen?	Causes and reasons	Root cause of sales decline
Predictive	What is likely to happen?	Forecasting	Churn prediction
Prescriptive	What should we do?	Action recommendation	Best retention strategy

The Analytics Maturity Model

Organizations typically progress through these four types over time as their data capabilities mature. Most businesses today are strong in descriptive analytics and actively building predictive capabilities. True prescriptive analytics at scale remains rare and represents a significant competitive advantage.

Level 1 — Descriptive: Basic reporting, spreadsheets, reactive decision-making
Level 2 — Diagnostic: Root-cause analysis, some SQL/BI tooling, more structured data teams
Level 3 — Predictive: ML models in production, data science teams, experimentation culture
Level 4 — Prescriptive: AI-driven decision systems, real-time optimization, closed-loop automation

Key idea: Organizations often use all four types together.

1.10 Key Terms & Definitions

Data: raw facts and figures
Information: processed data with meaning
Insight: useful understanding from analysis
KPI: key performance indicator
Dashboard: visual display of metrics
Forecasting: predicting future values
Model: mathematical or computational representation
Data Analytics: The process of examining, cleaning, transforming, and modeling data to discover useful information and support decision-making.
KPI (Key Performance Indicator): A measurable value that demonstrates how effectively an organization is achieving key business objectives.
Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize their main characteristics, often using visual methods, before formal modeling.
Feature Engineering: The process of using domain knowledge to create new input variables (features) from raw data to improve model performance.
Concept Drift: A phenomenon where the statistical properties of the target variable (what the model predicts) change over time, causing model performance to degrade.
Correlation: A statistical measure expressing the extent to which two variables are linearly related. Does not imply causation.
Causation: A relationship where one variable directly causes a change in another. Establishing causation requires controlled experiments or causal inference methods.
Overfitting: A modeling error where a model performs well on training data but poorly on new, unseen data because it has learned noise rather than signal.
ROI (Return on Investment): A performance measure used to evaluate the efficiency or profitability of an investment, often used to justify analytics initiatives.

1.11 Challenges in Adopting Data Analytics

Challenge	Description
Data Quality	Incomplete, inconsistent, or inaccurate data leads to unreliable insights (“Garbage In, Garbage Out”)
Data Silos	Data trapped in departmental systems that don’t communicate
Talent Shortage	Difficulty finding and retaining skilled data professionals
Privacy & Ethics	Compliance with GDPR, CCPA, and ethical use of data
Organizational Culture	Resistance to change from intuition-based to evidence-based decisions
Infrastructure Costs	Investment in tools, cloud computing, and storage
Interpretability	Stakeholders may not trust “black box” models

1.12 Summary

In this lesson, we learned:

Data analytics helps organizations make informed decisions
It plays a major role in business performance and innovation
The lifecycle includes problem definition through monitoring
The four types of analytics are:
- Descriptive
- Diagnostic
- Predictive
- Prescriptive

1.13 Quick Revision

Remember:

Data analytics turns raw data into useful insights
Good analytics starts with a clear problem
Data cleaning is essential
Descriptive and diagnostic focus on the past
Predictive and prescriptive focus on future decisions

1.14 Discussion / Review Questions

What is data analytics?
Why is data analytics important in organizations?
What are the stages of the data analytics lifecycle?
What is the difference between predictive and prescriptive analytics?
Give one example of each type of analytics
Describe three ways that data analytics creates competitive advantage for an organization. Provide a specific example for each.
A company notices that ice cream sales and drowning rates are positively correlated. Does this mean ice cream causes drowning? Explain your reasoning and identify the actual cause of this correlation.
A data team is asked to reduce customer churn. Walk through how you would apply each phase of the data analytics lifecycle to this problem.
Classify each of the following as descriptive, diagnostic, predictive, or prescriptive analytics, and justify your answer:

(a) A dashboard showing weekly website visitors.

(b) A model that recommends which items to reorder and in what quantity.

(c) An analysis identifying why last month’s marketing email had a low open rate.

(d) A model predicting which loan applicants are likely to default.
Why is data preparation often described as taking 60–80% of a data scientist’s time? What activities does this include, and what are the consequences of inadequate data preparation?

1.15 Further Reading & References

CRISP-DM Guide: Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a Standard Process Model for Data Mining.
Davenport, T.H., & Harris, J.G. (2007). Competing on Analytics. Harvard Business School Press.
Provost, F., & Fawcett, T. (2013). Data Science for Business. O’Reilly Media.
EMC Education Services. (2015). Data Science and Big Data Analytics. Wiley.
McKinsey Global Institute. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity.

Topic Two: Data Types and Data Collection

2.1 Introduction

Before any analysis can begin, a data professional must understand the nature and structure of data they are dealing with and where it came from. The wrong assumptions about data structure can invalidate an entire analysis. This chapter covers the landscape of data types, how data is collected, the characteristics of different data sources, and the foundational concepts of data quality and governance that determine whether data can be trusted.

The type of data determines:

How it is stored (databases, file systems, data lakes)
How it is processed (SQL queries, parsing, NLP, computer vision)
What tools and technologies are appropriate
What preprocessing steps are needed before analysis
What analytical methods can be applied

“Key Fact: According to IDC estimates, approximately 80–90% of all data generated today is unstructured or semi-structured. Yet most traditional analytics tools were designed for structured data. This gap is one of the primary drivers behind the growth of data science, ML, and AI”

2.2 Types of Data

2.2.1 Structured Data

Definition

Structured data is data that adheres to a predefined schema or data model, organized into rows and columns (tabular format), where each field has a defined data type and meaning.

It is the most traditional and well-understood form of data. It lives in relational databases and can be queried using SQL (Structured Query Language).

Characteristics

Property	Description
Schema	Predefined and rigid (schema-on-write)
Format	Tabular — rows (records) and columns (fields/attributes)
Data Types	Each column has a strict type (integer, float, varchar, date, boolean)
Storage	Relational Database Management Systems (RDBMS)
Query Language	SQL
Searchability	Highly searchable and filterable
Scalability	Vertical scaling; limited for massive volumes
Machine Readability	Easily consumed by analytics tools and algorithms

Examples of Structured Data

Example 1: Customer Table in a Relational Database

CustomerID	Name	Email	Age	City	SignupDate	AccountType
1001	Alice Johnson	alice@email.com	34	New York	2023-01-15	Premium
1002	Bob Smith	bob@email.com	28	Chicago	2023-03-22	Free
1003	Carol Williams	carol@email.com	45	Boston	2022-11-08	Premium

Example 2: Financial Transaction Records

TransactionID	Date	AccountNo	Type	Amount	Currency	Status
TXN-50001	2024-01-15	ACC-2234	Debit	250.00	USD	Completed
TXN-50002	2024-01-15	ACC-1187	Credit	1200.50	USD	Pending

Other Examples:

Spreadsheets (Excel, Google Sheets)
ERP system records (inventory, orders, invoices)
CRM data (customer profiles, interactions)
Sensor readings in fixed formats (temperature, pressure at timestamps)
Census data
Stock market tick data (OHLCV — Open, High, Low, Close, Volume)

Common Storage Systems for Structured Data

System	Examples
RDBMS	MySQL, PostgreSQL, Oracle, Microsoft SQL Server, SQLite
Cloud Data Warehouses	Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse
Spreadsheets	Microsoft Excel, Google Sheets
Flat Files	CSV (Comma-Separated Values), TSV (Tab-Separated Values)

Data Types Within Structured Data

Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid

Type	Subtype	Description	Examples	Valid Operations
Qualitative	Nominal	Categories with no inherent order	Gender, color, city, blood type, product category	Mode, frequency count, chi-square test
Qualitative	Ordinal	Categories with a meaningful order but unequal intervals	Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale	Mode, median, rank correlation, non-parametric tests
Quantitative	Discrete	Countable, finite values (usually integers)	Number of children, website clicks, defect count, cars owned	Mean, median, mode, standard deviation, Poisson regression
Quantitative	Continuous	Measurable, can take any value in a range (infinite precision)	Temperature, height, weight, salary, time	Mean, std dev, correlation, regression, t-test, ANOVA

Advantages and Limitations

Advantages	Limitations
Easy to query, filter, and aggregate (SQL)	Rigid schema — hard to modify after creation
Well-understood tools and technologies	Cannot represent complex, nested, or hierarchical data well
Strong data integrity (constraints, keys, types)	Doesn’t handle multimedia data (images, audio, video)
Efficient indexing and searching	Scaling horizontally (across servers) can be challenging
Mature ecosystem of tools	Only represents ~10-20% of all organizational data

2.2.2 Semi-Structured Data

Definition

Semi-structured data has some organizational properties (tags, markers, metadata, or hierarchical structure) but does not conform to a rigid tabular schema. It is self-describing — the structure is embedded within the data itself.

Semi-structured data sits between structured and unstructured data. It has some organization but is more flexible than a relational table.

Characteristics

Property	Description
Schema	Flexible, implicit, self-describing (schema-on-read)
Format	Hierarchical, nested, key-value pairs, tagged
Data Types	Mixed and flexible; fields can vary between records
Storage	NoSQL databases, document stores, file systems
Query Language	Varies (JSONPath, XPath, MongoDB query language, etc.)
Searchability	Searchable with appropriate tools but less straightforward than SQL
Flexibility	High — new fields can be added without changing existing structure

Example: Web Server Log (Semi-Structured)

192.168.1.105 - alice [15/Jan/2024:10:23:45 +0000] "GET /products/laptop HTTP/1.1" 200 5432 "https://google.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" 192.168.1.110 - bob [15/Jan/2024:10:24:02 +0000] "POST /cart/add HTTP/1.1" 201 342 "https://example.com/products" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0)"

This data has a recognizable pattern (IP, user, timestamp, request, status code, referrer, user agent) but is not in a tabular database. Parsing is required to extract structured fields.

2.2.3 Unstructured Data

Definition

Unstructured data has no predefined data model, schema, or organizational structure. It cannot be stored in traditional row-column databases without significant transformation.

This is the most abundant type of data in the world and the hardest to analyze using traditional methods. Advances in AI, NLP, and computer vision have made it increasingly possible to extract insights from unstructured data.

Characteristics

Property	Description
Schema	None — no predefined structure
Format	Free-form text, binary files, media
Storage	File systems, object stores, data lakes, content management systems
Analysis	Requires AI/ML techniques (NLP, computer vision, speech recognition)
Volume	Comprises ~80-90% of all enterprise data
Searchability	Difficult without indexing, tagging, or AI-based extraction

Category	Examples
Text	Emails (body), social media posts, reviews, news articles, legal contracts, medical notes, chat transcripts, books, research papers
Images	Photographs, medical scans (X-rays, MRIs, CT scans), satellite imagery, product photos, handwritten documents, diagrams
Audio	Phone call recordings, podcasts, music files, voice messages, voice assistant queries
Video	Surveillance footage, YouTube videos, webinars, live streams, movie files
Other	Geospatial data, 3D models, scientific instrument output, biometric data

How AI/ML Processes Unstructured Data

The core challenge with unstructured data is that machines cannot directly analyze raw text, images, or audio. These must be converted into numerical representations (features/vectors) first:

Data Type	AI/ML Technique	What It Does
Text	Natural Language Processing (NLP)	Tokenization, sentiment analysis, topic modeling, named entity recognition, text classification, machine translation
Text → Numbers	Word Embeddings (Word2Vec, GloVe, BERT)	Converts words/sentences into dense numerical vectors that capture semantic meaning
Images	Computer Vision (CNNs)	Object detection, image classification, facial recognition, segmentation
Images → Numbers	Feature Extraction (ResNet, VGG)	Converts images into numerical feature vectors using pre-trained neural networks
Audio	Speech Recognition (ASR)	Converts speech to text (e.g., Whisper, Google Speech API)
Audio → Numbers	Spectrograms, MFCCs	Converts audio waveforms into frequency-domain representations
Video	Video Analysis (3D CNNs, RNNs)	Action recognition, object tracking, scene understanding

Example: Turning Unstructured Text into Structured Data

Raw unstructured data (customer review)

"I absolutely love this laptop! The battery life is amazing and the  screen is gorgeous. However, the keyboard feels a bit cheap and the  trackpad is not very responsive. Overall, a great purchase for the price."

After NLP processing → Structured output:

Field	Extracted Value
Overall Sentiment	Positive (0.72)
Battery Sentiment	Very Positive (0.95)
Screen Sentiment	Very Positive (0.91)
Keyboard Sentiment	Negative (-0.45)
Trackpad Sentiment	Negative (-0.60)
Price Sentiment	Positive (0.65)
Named Entities	Product Type: Laptop
Key Topics	Battery, Screen, Keyboard, Trackpad, Price

This transformation from unstructured text to structured data is one of the most important applications of NLP in data science.

Data Types Within Structured Data

Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid

Type	Subtype	Description	Examples	Valid Operations
Qualitative	Nominal	Categories with no inherent order	Gender, color, city, blood type, product category	Mode, frequency count, chi-square test
Qualitative	Ordinal	Categories with a meaningful order but unequal intervals	Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale	Mode, median, rank correlation, non-parametric tests
Quantitative	Discrete	Countable, finite values (usually integers)	Number of children, website clicks, defect count, cars owned	Mean, median, mode, standard deviation, Poisson regression
Quantitative	Continuous	Measurable, can take any value in a range (infinite precision)	Temperature, height, weight, salary, time	Mean, std dev, correlation, regression, t-test, ANOVA

2.2.4 Special Data Types

Type	Description	Examples
Binary	Only two possible values	Yes/No, True/False, 0/1, Male/Female
Temporal	Date, time, datetime, timestamp	2024-01-15, 14:30:00, timestamps
Geospatial	Location-based data	Latitude/longitude, GPS coordinates, polygons
Currency	Monetary values with specific precision	$1,299.99, €45.50, Kshs1500
Text/String	Character sequences (can be categorical or free-text)	Names, descriptions, comments
Boolean	Logical true/false values	is_active, has_subscription

Why This Matters for ML: Algorithms require numerical input. Understanding data types determines how to encode categorical variables (one-hot encoding for nominal, label encoding for ordinal) and how to scale numerical variables (normalization, standardization).

2.3 Primary vs. Secondary Data

2.3.1 Primary Data

Definition

Primary data is data collected firsthand by the researcher or organization specifically for the current research question or business problem. It is original data that did not exist before the collection effort.

Characteristics

Property	Description
Originality	Collected for the first time, directly from the source
Specificity	Tailored to the exact research question
Control	Researcher controls methodology, sampling, and variables
Recency	Typically the most current/up-to-date data available
Cost	Generally more expensive and time-consuming to collect
Ownership	The collector owns the data

Methods of Primary Data Collection

Method	Description	Best For	Example
Surveys / Questionnaires	Structured questions distributed to a sample	Gathering opinions, preferences, demographics at scale	Customer satisfaction survey, NPS score
Interviews	In-depth, one-on-one or group conversations	Deep qualitative insights, understanding “why”	User research interviews for product design
Focus Groups	Moderated group discussions (6-12 participants)	Exploring perceptions, attitudes, new concepts	Testing reactions to a new product concept
Experiments / A/B Tests	Controlled manipulation of variables to measure effect	Establishing causal relationships	Testing two website layouts to see which converts better
Observations	Systematically watching and recording behavior	Understanding behavior in natural settings	Recording how customers navigate a store
Sensor / IoT Data Collection	Deploying instruments to measure physical phenomena	Real-time monitoring, environmental data	Installing temperature sensors in a warehouse
Web Scraping (owned properties)	Automated extraction of data from your own platforms	Collecting user interaction data	Logging clickstream data on your website
Clinical Trials	Controlled medical experiments	Testing drug efficacy and safety	Pharmaceutical Phase III trial
Field Research	Collecting samples or measurements in the field	Environmental, geological, agricultural research	Soil sampling for agricultural analysis

Advantages and Disadvantages of Primary Data

Advantages	Disadvantages
Directly relevant to research question	Expensive (survey design, distribution, collection)
Researcher controls quality and methodology	Time-consuming (weeks to months)
Most current and up-to-date	Requires expertise in research design
Can target specific populations	Subject to response bias, sampling bias
Proprietary — competitive advantage	Typically smaller sample sizes than secondary data
Can collect exactly the variables needed	Ethical considerations (consent, privacy, IRB approval)

Designing Good Primary Data Collection

Key Principles for Survey/Questionnaire Design:

Define clear objectives — What exactly do you want to learn?
Choose the right question types:
- Closed-ended (multiple choice, Likert scale, yes/no) — Easy to analyze quantitatively
- Open-ended (free text) — Rich qualitative data but harder to analyze
Avoid leading questions — “Don’t you agree our product is excellent?” ❌
Avoid double-barreled questions — “Is our product affordable and high-quality?” ❌ (these are two separate questions)
Use simple, unambiguous language
Consider question order — General to specific, easy to hard
Pilot test before full deployment
Ensure proper sampling — Random sampling, stratified sampling, etc.

2.3.2 Secondary Data

Definition

Secondary data is data that was originally collected by someone else for a different purpose and is being reused for the current analysis. The researcher accesses and analyzes existing data rather than collecting new data.

Characteristics

Property	Description
Originality	Pre-existing; not collected for the current purpose
Collection	No direct collection effort needed
Cost	Generally much cheaper (often free)
Speed	Available immediately or quickly
Scale	Often much larger datasets than primary collection allows
Control	No control over how data was collected, what was measured, or quality

Sources of Secondary Data

Category	Examples
Government & Public Institutions	Census data, Bureau of Labor Statistics, World Bank, WHO, UN Data, data.gov, Eurostat
Academic & Research	Published papers, university datasets, arXiv, Google Scholar, ICPSR
Industry Reports	Gartner, McKinsey, Deloitte, PwC, Forrester, Nielsen reports
Company Internal Data	Historical sales records, CRM data, past surveys, financial records (collected for operational purposes, now reused for analytics)
Open Data Platforms	Kaggle, UCI ML Repository, Google Dataset Search, AWS Open Data, HuggingFace Datasets
Social Media & Web	Twitter/X API data, Reddit, Wikipedia, Common Crawl
Financial Data	Yahoo Finance, Bloomberg, SEC filings (EDGAR), stock exchange data
Geospatial Data	OpenStreetMap, NASA Earthdata, Google Earth Engine
Healthcare Data	MIMIC-III (clinical data), NIH databases, CDC data
Media & News	News archives, GDELT project (global events database)

Advantages and Disadvantages of Secondary Data

Advantages	Disadvantages
Significantly cheaper (often free)	May not perfectly fit your research question
Available quickly — no collection time	No control over data quality or methodology
Often very large datasets	May be outdated
Enables historical and longitudinal analysis	Definitions/categories may not match your needs
Can cover broad geographies and populations	Potential biases from original collection unknown
Peer-reviewed or government-validated	May have restrictions on use (licensing, privacy)
Good for benchmarking and comparison	May lack variables you specifically need

Evaluating Secondary Data Quality

Before using secondary data, assess it critically:

Criterion	Questions to Ask
Source credibility	Who collected it? Is the source reputable? Government? Academic?
Purpose	Why was it originally collected? Could the purpose introduce bias?
Methodology	How was it collected? What sampling method was used?
Timeliness	When was it collected? Is it still relevant?
Accuracy	Are there known errors or limitations? Has it been peer-reviewed?
Consistency	Are definitions and units consistent across time periods?
Completeness	Are there significant gaps or missing data?
Accessibility	Can you access the granularity you need? Are there licensing restrictions?

2.3.3 Primary vs. Secondary Data — Comparison

Dimension	Primary Data	Secondary Data
Collected by	Researcher/organization for current purpose	Someone else for a different purpose
Relevance	Highly relevant and specific	May not perfectly fit
Cost	High	Low (often free)
Time to obtain	Weeks to months	Hours to days
Data quality control	Full control	No control
Sample size	Usually smaller	Often very large
Recency	Most current	May be outdated
Ownership	You own it	May have usage restrictions
Bias awareness	Known (you designed the study)	Unknown or undocumented
Uniqueness	Proprietary — competitive advantage	Available to competitors too

2.3.4 When to Use Which?

Use Primary Data When…	Use Secondary Data When…
No existing data answers your question	Existing data adequately addresses your question
You need very specific variables	You need broad coverage or historical data
Data quality is paramount	Budget and time are limited
You need proprietary insights	You need a starting point for exploratory analysis
Establishing causal relationships (experiments)	Benchmarking against industry or population data
Regulatory requirements demand original data	Supplementing primary data with contextual data

“Best Practice: Most data science projects use a combination of both. For example, a company might use its own customer transaction data (primary) enriched with census demographic data (secondary) and weather data (secondary) to build a predictive model.”

2.4 Data Sources: Surveys, APIs, Databases, and Web Data

2.4.1 Surveys and Forms

What Are Surveys?

Surveys are systematic methods of gathering information from a defined population through a set of structured or semi-structured questions, typically for research, feedback, or data collection purposes.

Types of Surveys

Type	Description	Advantages	Limitations
Online Surveys	Web-based questionnaires distributed via email, social media, or embedded in websites	Cheap, fast, wide reach, easy analysis	Low response rates, self-selection bias, no interviewer to clarify
Telephone Surveys (CATI)	Computer-Assisted Telephone Interviewing	Higher response rates than online, can clarify questions	Expensive, declining landline usage, time-consuming
Face-to-Face Interviews	In-person structured or semi-structured interviews	Highest quality responses, non-verbal cues	Very expensive, interviewer bias, not scalable
Mail Surveys	Paper questionnaires sent and returned by mail	Reaches populations without internet	Slowest method, very low response rates
Mobile Surveys	Optimized for smartphones	High accessibility, in-the-moment capture	Screen size limits complexity
Longitudinal / Panel Surveys	Same participants surveyed repeatedly over time	Tracks changes and trends over time	Attrition of participants

Popular Survey Tools

Tool	Key Features
Google Forms	Free, simple, integrates with Google Sheets
SurveyMonkey	Professional features, templates, analytics
Typeform	Interactive, conversational UI, good UX
Qualtrics	Enterprise-grade, advanced logic, research-focused
Microsoft Forms	Integrated with Microsoft 365
LimeSurvey	Open-source, self-hosted option
REDCap	Specialized for clinical and academic research

Sampling Methods for Surveys

Method	Type	Description
Simple Random	Probability	Every member has an equal chance of selection
Stratified Random	Probability	Population divided into strata; random sample from each stratum
Cluster	Probability	Population divided into clusters; entire clusters randomly selected
Systematic	Probability	Every kth member selected from a list
Convenience	Non-Probability	Whoever is available/easiest to reach
Snowball	Non-Probability	Existing participants recruit future participants
Quota	Non-Probability	Sample selected to match known population proportions
Purposive/Judgmental	Non-Probability	Researcher selects participants based on judgment

Key Consideration: Probability sampling allows statistical inference to the broader population. Non-probability sampling is easier and cheaper but results cannot be generalized with the same confidence.

Common Survey Biases

Bias	Description	Mitigation
Selection Bias	Sample not representative of the population	Use probability sampling
Response Bias	Respondents answer inaccurately (social desirability, acquiescence)	Anonymize, use neutral wording
Non-Response Bias	Those who respond differ systematically from those who don’t	Follow-up reminders, incentives, analyze non-respondents
Leading Question Bias	Questions that suggest a desired answer	Neutral question wording, pilot testing
Recall Bias	Respondents don’t accurately remember past events	Use shorter recall periods, provide reference points
Survivorship Bias	Only surveying current customers, not those who left	Include churned/former customers
Order Effects	Answer influenced by position in questionnaire	Randomize question order

2.4.2 APIs (Application Programming Interfaces)

What Is an API?

An API (Application Programming Interface) is a set of defined rules, protocols, and tools that allows different software applications to communicate with each other and exchange data in a structured, programmatic way.

For data scientists, APIs are a primary mechanism for programmatically accessing data from external services, platforms, and databases.

Popular APIs for Data Science

Category	API	Data Provided
Social Media	Twitter/X API, Reddit API, Meta Graph API	Posts, tweets, user data, engagement metrics
Finance	Alpha Vantage, Yahoo Finance, Polygon.io, Quandl	Stock prices, financial statements, crypto data
Weather	OpenWeatherMap, WeatherAPI, NOAA	Current weather, forecasts, historical weather
Maps & Location	Google Maps API, OpenStreetMap, Mapbox	Geocoding, directions, places, traffic
NLP & AI	OpenAI API (GPT), Google Cloud NLP, HuggingFace	Text generation, sentiment analysis, translation
Government	Census API, data.gov, World Bank API	Demographics, economic indicators, health data
E-commerce	Amazon Product API, Shopify API, eBay API	Product data, pricing, reviews
News	NewsAPI, GDELT, NYTimes API	News articles, headlines, events
Music	Spotify API, Last.fm API	Song data, playlists, listening history
Sports	ESPN API, SportRadar, NBA Stats API	Scores, player statistics, game data

API Authentication Methods

Method	Description	Security Level
API Key	Simple key passed as a query parameter or header	Basic
OAuth 2.0	Token-based authorization; user grants permission	High
Bearer Token	Token included in the HTTP Authorization header	Medium-High
Basic Auth	Username and password encoded in base64	Low
JWT (JSON Web Token)	Self-contained token with encoded user info	High

API Best Practices for Data Collection

Practice	Description
Respect Rate Limits	Most APIs limit requests per minute/hour; implement backoff strategies
Cache Responses	Store API responses locally to avoid redundant calls
Handle Errors Gracefully	Implement try/except blocks, retry logic, and logging
Paginate Large Requests	Many APIs return data in pages; loop through all pages
Secure Your Keys	Never hardcode API keys; use environment variables or secret managers
Read Documentation	Always read the API docs thoroughly before coding
Monitor Usage	Track API consumption to avoid exceeding quotas or incurring costs
Version Awareness	APIs can change; pin to specific versions when possible

2.4.3 Databases and Data Stores

What Is a Database?

A database is an organized collection of data stored and accessed electronically, managed by a Database Management System (DBMS) that provides mechanisms for storing, retrieving, updating, and managing data.

Relational Databases (RDBMS)

The backbone of structured data storage for decades. Data is organized into tables with relationships between them.

2.4.4 Core Concepts

Concept	Description
Table (Relation)	A collection of related data organized in rows and columns
Row (Record/Tuple)	A single data entry in a table
Column (Field/Attribute)	A specific property/variable in a table
Primary Key (PK)	A unique identifier for each row in a table
Foreign Key (FK)	A column that references the primary key of another table (creates relationships)
Index	A data structure that speeds up data retrieval
Schema	The blueprint/structure of the database (tables, columns, types, constraints)
View	A virtual table based on the result of a SQL query
Stored Procedure	Pre-compiled SQL code stored in the database

Popular Relational Databases

Database	Type	Best For
PostgreSQL	Open source	General purpose, advanced features, geospatial
MySQL	Open source	Web applications, WordPress, scalable reads
SQLite	Embedded	Local applications, prototyping, mobile apps
Microsoft SQL Server	Commercial	Enterprise Windows environments
Oracle Database	Commercial	Large enterprise, financial services
MariaDB	Open source (MySQL fork)	Drop-in MySQL replacement

2.4.5 Summary: Choosing the Right Data Source

Factor	Consideration
Research Question	What data do you actually need to answer your question?
Availability	Does the data exist? Is it accessible?
Quality	How reliable, complete, and accurate is the data?
Cost	What is the budget for data acquisition?
Time	How quickly do you need the data?
Legal/Ethical	Are there privacy, licensing, or ethical constraints?
Format	Can you work with the data format, or is significant transformation needed?
Scale	Is the data volume appropriate for your analysis needs?
Freshness	How current does the data need to be?

2.5 Data Quality and Data Governance Concepts

Why Data Quality Matters

“Garbage In, Garbage Out” (GIGO)— The most sophisticated algorithm in the world will produce meaningless results if fed poor-quality data.

Data quality is not just a technical concern — it has real business impact:

Impact Area	Consequence of Poor Data Quality
Decision-Making	Wrong conclusions lead to wrong decisions
Financial	Gartner estimates poor data quality costs organizations an average of $12.9 million per year
Customer Experience	Wrong addresses, duplicate communications, personalization failures
Regulatory	Compliance violations (fines under GDPR can reach €20M or 4% of global revenue)
Model Performance	ML models trained on dirty data produce unreliable predictions
Operational	Failed processes, reconciliation delays, manual workarounds
Trust	Stakeholders lose confidence in analytics and reporting
Opportunity Cost	Data scientists spend 60-80% of their time cleaning data instead of analyzing it

2.5.1 Dimensions of Data Quality

Data quality is multidimensional. A dataset may score well on one dimension but poorly on another. The most widely recognized dimensions are:

The Six Core Dimensions (DAMA Framework)

#	Dimension	Definition	Example of Poor Quality	How to Measure
1	Accuracy	Data correctly represents the real-world entity or event it models	Customer age recorded as 250; address doesn’t match actual location	% of records matching a verified source; error rate
2	Completeness	All required data is present; no critical values are missing	30% of customer records have no email address; missing ZIP codes	% of non-null values; % of records with all required fields
3	Consistency	Data does not contradict itself across systems or within a dataset	Customer listed as “Active” in CRM but “Cancelled” in billing system; “NY” vs “New York”	Cross-system reconciliation; # of conflicting records
4	Timeliness	Data is up-to-date and available when needed	Using 2019 market data for 2024 decisions; dashboard refreshed weekly instead of hourly	Data age; refresh frequency; latency
5	Validity	Data conforms to defined formats, ranges, and business rules	Email without “@” symbol; age = -5; date format “13/25/2024”	% passing validation rules; # of constraint violations
6	Uniqueness	Each entity is represented only once (no duplicates)	Same customer appears 3 times with slightly different names	Duplicate rate; # of records after deduplication

Additional Quality Dimensions

Dimension	Definition	Example
Integrity	Relationships between data elements are maintained (referential integrity)	An order references a customer_id that doesn’t exist in the customer table
Relevance	Data is applicable and useful for the intended purpose	Collecting shoe size data for a financial fraud model
Precision	Level of detail/granularity is appropriate	Recording revenue as “about $1M” vs “$1,023,456.78”
Conformity	Data follows standard formats and naming conventions	Dates stored as “Jan 15, 2024”, “2024-01-15”, “15/01/2024” inconsistently
Auditability	Data lineage and changes can be traced	No log of who changed a record or when

Common Data Quality Issues

Issue	Description	Example	Impact
Missing Values	Null, blank, or absent data points	Empty phone number field; NULL income	Biased analysis; model errors
Duplicates	Same entity recorded multiple times	“John Smith” and “Jon Smith” at same address	Inflated counts; wasted marketing spend
Inconsistent Formats	Same information represented differently	“USA”, “United States”, “US”, “U.S.A.”	Grouping/aggregation errors
Outliers	Extreme values that may or may not be valid	Salary of $10,000,000 for a junior analyst	Skewed statistics; model distortion
Stale Data	Data that is no longer current	Customer address from 5 years ago	Failed deliveries; wrong analysis
Incorrect Data	Factually wrong values	Birth year 2095; negative quantities	Wrong conclusions; compliance risk
Encoding Issues	Character set or encoding problems	“CafÃ©” instead of “Café”; garbled text	Data loss; parsing failures
Schema Changes	Data structure changes without documentation	New column added; column renamed	Pipeline failures; broken queries
Unit Mismatches	Different measurement units mixed	Temperature in Celsius and Fahrenheit in same column	Mathematical errors
Selection Bias	Data doesn’t represent the target population	Only surveying English-speaking users	Biased models; unfair outcomes
Label Errors	Incorrect labels in supervised learning data	Image of a cat labeled as “dog”	Poor model training

Data Quality Assessment and Profiling

Data profiling is the process of examining data to understand its structure, content, quality, and relationships. It is the first step in any data quality improvement effort.

Data Profiling Techniques

Technique	What It Reveals
Column Analysis	Data type, % null, distinct values, min/max, mean/median, distribution
Pattern Analysis	Common formats, regex patterns, unexpected characters
Frequency Analysis	Most/least common values, distribution of categories
Cross-Column Analysis	Correlations, dependencies, functional relationships
Cross-Table Analysis	Referential integrity, join quality, orphan records
Temporal Analysis	Trends over time, gaps in time series, seasonality
Rule-Based Validation	Checking against predefined business rules

2.5.2 Data Profiling Example in Python

*import pandas as pd import numpy as np

Load dataset

df = pd.read_csv(‘customer_data.csv’)

=== BASIC PROFILING ===

Shape and types

print(f”Shape: {df.shape}“) print(f”\nData Types:\n{df.dtypes}“)

Missing values analysis

missing = df.isnull().sum() missing_pct = (df.isnull().sum() / len(df)) * 100 missing_report = pd.DataFrame({ ‘Missing Count’: missing, ‘Missing %’: missing_pct.round(2) }).sort_values(‘Missing %’, ascending=False) print(f”\nMissing Values:\n{missing_report}“)

Summary statistics

print(f”\nNumerical Summary:\n{df.describe()}“) print(f”\nCategorical Summary:\n{df.describe(include=‘object’)}“)

Duplicate detection

duplicates = df.duplicated().sum() print(f”\nDuplicate Rows: {duplicates} ({(duplicates/len(df)*100):.2f}%)“)

Unique values per column

for col in df.columns: n_unique = df[col].nunique() print(f”{col}: {n_unique} unique values ({(n_unique/len(df)*100):.1f}%)“)

Value distribution for categorical columns

for col in df.select_dtypes(include=‘object’).columns: print(f”\n{col} - Top 10 Values:“) print(df[col].value_counts().head(10))

Outlier detection using IQR

for col in df.select_dtypes(include=[np.number]).columns: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = ((df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)).sum() print(f”{col}: {outliers} outliers ({(outliers/len(df)*100):.2f}%)“)*

Why Data Quality Matters

*# === AUTOMATED PROFILING TOOLS ===

Using pandas-profiling (ydata-profiling)

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title=“Customer Data Quality Report”) profile.to_file(“data_quality_report.html”)

Using Great Expectations (rule-based validation)

import great_expectations as gx

context = gx.get_context() # Define expectations validator = context.sources.pandas_default.read_dataframe(df) validator.expect_column_values_to_not_be_null(“customer_id”) validator.expect_column_values_to_be_between(“age”, min_value=0, max_value=120) validator.expect_column_values_to_match_regex(“email”, r”^[\w\.-]+@[\w\.-]+\.\w+$“) validator.expect_column_values_to_be_in_set(”status”, [“Active”, “Inactive”, “Suspended”]) validator.expect_column_values_to_be_unique(“customer_id”)*

2.5.3 Data Cleaning Strategies

Issue	Strategy	Python Example
Missing Values	Drop, impute (mean/median/mode), forward/backward fill, predictive imputation	`df['age'].fillna(df['age'].median(), inplace=True)`
Duplicates	Identify and remove exact and fuzzy duplicates	`df.drop_duplicates(subset=['email'], keep='first', inplace=True)`
Inconsistent Categories	Standardize and map values	`df['country'].replace({'US':'United States', 'USA':'United States'}, inplace=True)`
Outliers	Remove, cap/floor (winsorize), or transform	`df['income'] = df['income'].clip(lower=0, upper=df['income'].quantile(0.99))`
Wrong Data Types	Cast to correct types	`df['date'] = pd.to_datetime(df['date'])`
Whitespace/Formatting	Strip and normalize	`df['name'] = df['name'].str.strip().str.title()`
Invalid Values	Validate against rules; replace or flag	`df.loc[df['age'] < 0, 'age'] = np.nan`

2.5.4 Data Governance

Definition

Data Governance is the overall management of the availability, usability, integrity, quality, and security of data used in an organization. It establishes the policies, processes, standards, roles, and metrics that ensure effective and efficient use of data.

Data governance is not just a technology problem — it is an organizational discipline that encompasses people, processes, and technology.

2.5.5 Data Governance Policies

Policy Area	Description	Examples
Data Classification	Categorizing data by sensitivity level	Public, Internal, Confidential, Restricted/Secret
Data Access Control	Who can access what data and under what conditions	Role-based access (RBAC), need-to-know basis
Data Retention	How long data is kept and when it is archived/deleted	Financial records retained for 7 years; logs for 90 days
Data Privacy	How personal data is collected, used, stored, and shared	Consent management, anonymization, right to deletion
Data Quality Standards	Minimum quality thresholds for data to be used	Completeness > 95%, accuracy verified quarterly
Data Sharing	Rules for sharing data internally and externally	Data sharing agreements, anonymization requirements
Acceptable Use	How data may and may not be used	No using health data for marketing without explicit consent
Master Data Management	Standards for maintaining master/reference data	Single source of truth for customer, product data

2.5.6 Data Privacy and Regulatory Compliance

Major Data Privacy Regulations

Regulation	Region	Key Requirements
GDPR (General Data Protection Regulation)	European Union (2018)	Consent, right to access/delete/port data, data protection by design, breach notification within 72 hours, DPO appointment
CCPA / CPRA (California Consumer Privacy Act / California Privacy Rights Act)	California, USA (2020/2023)	Right to know, delete, opt-out of data sale, non-discrimination
HIPAA (Health Insurance Portability and Accountability Act)	USA	Protects health information (PHI); strict security and access controls

GDPR Key Principles

Principle	Description
Lawfulness, Fairness, Transparency	Data must be processed legally, fairly, and transparently
Purpose Limitation	Data collected for specific, explicit purposes only
Data Minimization	Only collect data that is necessary for the stated purpose
Accuracy	Data must be accurate and kept up to date
Storage Limitation	Data should not be kept longer than necessary
Integrity & Confidentiality	Data must be protected against unauthorized access, loss, or damage
Accountability	Organizations must demonstrate compliance

Data Security Fundamentals

Security Measure	Description
Encryption	Encrypting data at rest (stored) and in transit (transmitted) — AES-256, TLS/SSL
Access Control	Role-Based Access Control (RBAC); principle of least privilege
Authentication	Verifying identity — passwords, MFA (multi-factor authentication), SSO
Authorization	Determining what authenticated users are allowed to do
Anonymization	Removing personally identifiable information (PII) irreversibly
Pseudonymization	Replacing identifiers with pseudonyms (reversible with a key)
Data Masking	Hiding sensitive data (e.g., showing only last 4 digits of SSN: XXX-XX-1234)
Audit Logging	Recording who accessed what data, when, and what they did
Network Security	Firewalls, VPNs, intrusion detection systems
Backup & Recovery	Regular backups with tested recovery procedures

2.5.7 Data Ethics

Beyond legal compliance, data practitioners must consider ethical responsibilities:

2.5.8 Key Ethical Principles

Principle	Description
Transparency	Be clear about what data you collect, why, and how it will be used
Fairness	Ensure analyses and models do not discriminate against protected groups
Privacy	Respect individuals’ right to control their personal information
Consent	Obtain informed consent before collecting personal data
Accountability	Take responsibility for the outcomes of data-driven decisions
Beneficence	Ensure data use creates benefit and minimizes harm
Data Minimization	Only collect and retain data that is necessary
Human Oversight	Maintain human review for high-stakes automated decisions

2.6 Summary

Data comes in three structural forms:
- Structured (~10-20%): Tables with rows and columns, stored in RDBMS, queried with SQL
- Semi-Structured (~5-10%): JSON, XML, logs — some organization but flexible schema
- Unstructured (~80-90%): Text, images, audio, video — requires AI/ML to analyze
Primary data is collected firsthand for your specific purpose (high relevance, high cost). Secondary data is pre-existing data collected by others (lower cost, may not perfectly fit).
Data sources are diverse — surveys, APIs, databases, web scraping, IoT sensors, public datasets. The choice depends on the research question, cost, quality, and legal/ethical constraints.
Data quality has multiple dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Poor data quality has significant financial and operational consequences.
Data governance is the organizational framework (people, policies, processes, technology) that ensures data is managed as a strategic asset — covering quality, security, privacy, compliance, and ethics.
Privacy regulations (GDPR, CCPA, HIPAA, etc.) impose strict requirements on how personal data is collected, processed, stored, and shared. Non-compliance carries severe penalties.
Ethics in data science goes beyond legal compliance — practitioners must consider fairness, transparency, consent, bias, and the potential for harm in all data-related activities.
Data catalogs and lineage help organizations discover, understand, trust, and trace their data assets.

2.7 Review Questions

A hospital stores patient records in a relational database (name, DOB, diagnosis codes), medical images (X-rays, MRIs) in a file system, and doctor’s notes as free-text documents. Classify each data type as structured, semi-structured, or unstructured. What different tools/techniques would be needed to analyze each?
A startup wants to build a model predicting restaurant success in a new city. Propose a data collection strategy that uses both primary and secondary data. What specific sources would you recommend?
You receive a customer dataset with 2 million records. Describe the step-by-step data profiling process you would follow to assess its quality. What specific checks would you perform?
A company’s marketing team wants to purchase third-party data about consumer spending habits to enrich their customer profiles. What data governance and ethical considerations should be evaluated before proceeding?
Design a simple data governance framework for a mid-sized e-commerce company. What roles, policies, and tools would you recommend?
Explain how data lineage could help a data analyst debug a dashboard that is showing incorrect revenue figures.

Exercise 1: Data Profiling

Download the Titanic dataset from Kaggle and perform a complete data quality assessment using Python. Report on missing values, data types, duplicates, outliers, and inconsistencies.

Exercise 2: API Data Collection

Write a Python script that collects weather data for 5 cities using the OpenWeatherMap API, stores the results in a pandas DataFrame, and exports to CSV.

Exercise 3: Data Cleaning Pipeline

Given a messy dataset with missing values, duplicates, inconsistent formatting, and outliers, write a Python data cleaning pipeline that addresses all issues and produces an analysis-ready dataset.

2.8 Further Reading & References

DAMA International. (2017). DAMA-DMBOK: Data Management Body of Knowledge (2nd Edition). Technics Publications.
Redman, T.C. (2008). Data Driven: Profiting from Your Most Important Business Asset. Harvard Business Press.
O’Reilly. (2022). Fundamentals of Data Engineering. Reis, J. & Housley, M.
Ladley, J. (2019). Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program (2nd Edition). Academic Press.
European Commission. GDPR Official Text: https://gdpr.eu/
Great Expectations Documentation: https://greatexpectations.io/

Topic Three: Data Cleaning and Preprocessing

3.1 Handling Missing Data

Missing values can arise from data entry errors, system issues, or unavailable information.

Common approaches

Remove rows/columns
- Use when missingness is small or the feature is not important.
Imputation
- Numerical data: mean, median, interpolation
- Categorical data: mode, “Unknown” category
- Advanced methods: KNN imputation, model-based imputation
Flag missingness
- Add a binary feature indicating whether a value was missing.

Key consideration

Understand whether data is:

MCAR: Missing Completely at Random
MAR: Missing at Random
MNAR: Missing Not at Random

3.2. Outlier Detection and Treatment

Outliers are data points that differ significantly from other observations. They can be genuine extreme cases or simple errors.

Detection methods

Statistical
- Z-Score: Identifying points that fall more than 3 standard deviations from the mean)
- IQR Method: Defining outliers as points falling below $Q1 - 1.5 \times IQR$ or above
Visualization
- Boxplots
- Scatterplots
- Histograms
Model-based
- Isolation Forest
- DBSCAN
- Local Outlier Factor

Treatment options

Remove if clearly erroneous
Cap/winsorize extreme values
Transform data
Keep them if they are valid and meaningful

3.3 Data Transformation and Normalization

Used to make data suitable for analysis or machine learning.

Transformation techniques

Log Transformation: Used to handle skewed data and help it approximate a normal distribution.
Square root / Box-Cox / Yeo-Johnson: stabilize variance
Binning: convert continuous values into intervals

Scaling / Normalization

Most machine learning algorithms (like SVM or K-Means) are sensitive to the scale of data. If one feature ranges from 0–1 and another from 0–10,000, the larger scale will dominate the model.

Min-Max Scaling: rescales to [0,1]
Standardization: mean = 0, std = 1
Robust Scaling: uses median and IQR; useful with outliers

3.4. Feature Creation and Encoding

Computers process numbers, not text. Therefore, categorical “labels” must be converted into numerical formats
Improves model performance by making raw data more informative.

Feature creation

Date-based features: year, month, day, weekday
Aggregations: totals, averages, counts
Interaction terms: multiply or combine variables
Domain-specific engineered variables

Encoding categorical variables

Label Encoding: assign numeric labels [Assigning a unique integer to each category (e.g., Red=1, Blue=2). Best for ordinal data where order matters (e.g., Small, Medium, Large)]
One-Hot Encoding: create binary columns
Ordinal Encoding: for ordered categories
Target / Frequency Encoding: useful for high-cardinality categories

3.5 Reproducible Data Workflows

Ensures preprocessing is consistent, traceable, and reusable.

Best practices

Keep raw data unchanged
Automate preprocessing with scripts/pipelines
Document assumptions and steps
Use version control (e.g., Git)
Set random seeds for reproducibility
Use notebooks carefully; move final logic into reusable code
Track data and model versions

Tools often used

Python: pandas, scikit-learn, numpy R: tidyverse, caret, tidymodels
Workflow tools: Pipeline, ColumnTransformer
Experiment/data tracking: MLflow, DVC

3.6 Summary

Data cleaning and preprocessing typically include:

Fixing or imputing missing values
Detecting and handling outliers
Transforming and scaling data
Creating useful features and encoding categories
Building reproducible workflows for consistency