Module One : Foundations of Data Analytics

Chuka University
CDAM
Data Science
Machine Learning
AI Tools
Statistics
Author

D K. Muriithi | CDAM-Chuka University

Published

April 1, 2026

Topic One: Introduction to Data Analytics

Learning Objectives

By the end of this topic, students should be able to:

  • Define data analytics

  • Explain the role of data analytics in organizations

  • Describe the data analytics lifecycle

  • Differentiate between descriptive, diagnostic, predictive, and prescriptive analytics

1.1 What is Data Analytics?

Data analytics is the process of collecting, cleaning, transforming, and analyzing data to generate insights and support decision-making. It bridges the gap between raw data and actionable insight. Data analytics is one of the most transformative disciplines in modern business and science.

Main purpose:

  • Understand past events

  • Explain causes

  • Predict future outcomes

  • Recommend actions

Why organizations use it:

  • Helps in decision-making (data > guesswork)

  • Improves efficiency (find and fix problems)

  • Understands customers better

  • Detects risks & fraud

  • Supports planning for the future

  • Tracks performance (KPIs)

📝 In short:

Data analytics turns raw data → useful insights → better actions

1.2 Data Analytics in the Context of DS, ML, and AI

Relationship:

  • Data Analytics → extracts insights from data

  • Data Science → broader field combining statistics, programming, and domain knowledge

  • Machine Learning → builds models that learn from data

  • Artificial Intelligence → enables intelligent systems and automated decisions

Key idea:

Data analytics provides the foundation for ML and AI by preparing and interpreting data.

1.3 Why Data Analytics Matters in Organizations

Organizations today generate and have access to massive volumes of data. The ability to harness this data for actionable insight provides a strategic competitive advantage.

Key Reasons Organizations Invest in Data Analytics

# Reason Description
1 Informed Decision-Making Replace gut-feeling decisions with evidence-based strategies
2 Cost Reduction Identify inefficiencies, optimize supply chains, reduce waste
3 Revenue Growth Discover new market opportunities, optimize pricing, upselling
4 Risk Management Detect fraud, predict failures, assess credit risk
5 Customer Understanding Segment customers, personalize experiences, predict churn
6 Operational Efficiency Streamline processes, predictive maintenance, workforce optimization
7 Innovation Identify trends, develop new products, explore new business models

Organizations use data analytics to:

  • Make better decisions

  • Improve efficiency

  • Understand customers

  • Measure performance

  • Gain competitive advantage

Example:

A retail company analyzes sales data to decide which products to restock.

1.4 Data Analytics Across Industries

Data analytics is not limited to technology companies. Its applications span virtually every sector:

Healthcare

  • Predicting disease outbreaks and patient readmission

  • Drug discovery and genomics research

  • Optimizing hospital resource allocation

  • Personalized treatment plans

Finance & Banking

  • Credit scoring and loan approval

  • Algorithmic trading

  • Fraud detection in real time

  • Anti-money laundering (AML) compliance

Retail & E-Commerce

  • Customer segmentation and targeting

  • Recommendation engines (e.g., Amazon, Netflix)

  • Demand forecasting and inventory management

  • Dynamic pricing strategies

Manufacturing

  • Predictive maintenance of equipment

  • Quality control through sensor data

  • Supply chain optimization

  • Digital twins for simulation

Government & Public Sector

  • Census data analysis and urban planning
  • Crime pattern analysis and predictive policing
  • Tax fraud detection
  • Public health monitoring

Telecommunications

  • Network optimization

  • Customer churn prediction

  • Sentiment analysis of customer feedback

Sports

  • Player performance analytics (e.g., Moneyball)

  • Injury prediction

  • Fan engagement optimization

1.5 Role of Data Analytics in Organizations

1. Better decision-making

  • Supports evidence-based decisions (Helps managers make informed decisions)

  • Reduces reliance on guesswork

📝 Example: Choosing the best product to invest in

2. Process improvement/Improving Efficiency

  • Finds bottlenecks and inefficiencies
  • Identifies waste, delays, and bottlenecks
  • Optimizes processes

📝 Example: Reducing delivery time in logistics

3. Customer understanding

  • Identifies customer needs and preferences
  • Analyzes customer behavior & preferences
  • Helps in customer segmentation

📝 Example: Recommending products/services

4. Risk management

  • Detects fraud, errors, and unusual patterns/behavior

  • Supports safer operations

    📝 Example: Fraud detection in banking

5. Strategic Planning

  • Helps in long-term planning and forecasting

  • Identifies market trends

📝 Example: Deciding future business expansion

6. Performance Measurement

  • Tracks progress using KPIs

  • Evaluates success of strategies

📝 Example: Monitoring sales performance

More Organizational Uses by Departments using analytics:

  • Marketing: customer segmentation, campaign analysis

  • Finance: forecasting, fraud detection

  • Operations: inventory and supply chain optimization

  • HR: employee performance and retention

  • Healthcare: patient analysis and disease prediction

1.6 The Data-Driven Organization: Key Characteristics

Research by McKinsey, MIT, and others consistently shows that data-driven organizations share common traits:

  • Data is treated as a strategic asset, with governance and quality standards

  • Decision-making at all levels is supported (not replaced) by data

  • A culture of experimentation exists — hypotheses are tested, not just assumed

  • Cross-functional data teams collaborate with domain experts

  • Investment in data infrastructure: pipelines, warehouses, and visualization tools

1.7 Key Roles in a Data Analytics Team

Modern analytics organizations typically include the following roles, though boundaries often overlap:

  • Data Analyst: Explores and summarizes data, builds dashboards and reports, answers specific business questions using SQL, Excel, and BI tools.

  • Data Scientist: Builds statistical and machine learning models to generate predictions, uncover patterns, and automate decisions. Requires strong coding and math skills.

  • Data Engineer: Designs and maintains the pipelines and infrastructure that move, store, and process data reliably at scale.

  • ML Engineer: Production machine learning models — taking a data scientist’s prototype and deploying it into reliable, scalable systems.

1.8 Data Analytics Lifecycle

The data analytics lifecycle is the sequence of steps followed in an analytics project.

Typical stages:

  1. Problem definition

  2. Data collection

  3. Data preparation

  4. Data exploration

  5. Analysis/modeling

  6. Interpretation

  7. Communication

  8. Deployment

  9. Monitoring

Step 1 – Problem Definition

Business Understanding

Every data project starts not with data, but with a question. This phase defines what success looks like.

  • Clearly articulate the business problem or opportunity

  • Define measurable objectives and KPIs (Key Performance Indicators)

  • Identify stakeholders and understand their constraints

  • Determine whether the problem requires analytics at all — sometimes the answer is simpler

  • Agree on how the output will be used and by whom

Before analyzing data, define:

  • What problem needs to be solved?

  • What is the business goal?

  • What decisions will depend on the analysis?

  • How will success be measured?

Example:

Reduce customer churn in a telecom company.

Step 2 – Data Collection

Gather relevant data from different sources:

  • Databases

  • Spreadsheets

  • Websites

  • Sensors

  • APIs

  • Surveys

  • Social media

  • Logs

Goal: Collect data relevant to the problem.

Data Understanding

Once the problem is defined, the team explores what data exists and what additional data may be needed.

  • Inventory available data sources (internal databases, APIs, third-party feeds)

  • Conduct Exploratory Data Analysis (EDA) to understand distributions, ranges, and patterns

  • Assess data quality: completeness, accuracy, consistency, timeliness

  • Identify missing values,outliers, and anomalies

  • Document data lineage and metadata

Step 3 – Data Cleaning & Preparation

Often the most time-consuming phase, data preparation (or ‘data wrangling’) transforms raw data into a form suitable for analysis. Industry surveys consistently report that data professionals spend 60–80% of their time on this phase.

👉 Raw data often contains problems.

Common tasks:

  • Data cleaning: handling missing values, correcting errors, removing duplicates, Handle outliers, Standardize formats

  • Feature engineering: creating new variables from existing ones to better capture patterns.

  • Data transformation: normalization, encoding categorical variables, log transforms.

  • Data integration: joining data from multiple sources into a unified dataset (Merge datasets).

  • Splitting data: creating training, validation, and test sets for modeling

Important note:

Poor-quality data leads to poor analysis.

Step 4 – Data Exploration

Purpose: Understand the structure and patterns in the data.

Activities:

  • Summary statistics

  • Visualizations

  • Correlation checks

  • Trend analysis

  • Detect anomalies

Output: Early insights and hypotheses

Step 5 – Data Analysis / Modeling

Methods used:

  • Statistical analysis

  • Dashboards and reporting

  • Forecasting

  • Machine learning models

  • Optimization

Goal: Answer the business question using appropriate techniques.

Modeling

With clean, prepared data, analysts and data scientists build the analytical or machine learning models that address the business question.

  • Select appropriate modeling techniques (regression, classification, clustering, time series, etc.)

  • Train models on training data

  • Tune hyperparameters and validate model choices

  • Compare models using appropriate metrics (accuracy, RMSE, AUC-ROC, F1-score, etc.)

  • Document model assumptions and limitations

Step 6 – Interpretation of Results

Ask:

  • What do the results mean?

  • Are the findings reliable?

  • Do they answer the original question?

  • What actions should follow?

Key point:

Results must be linked to business value.

Step 7 – Communication and Visualization

Results should be presented clearly using:

  • Charts

  • Graphs

  • Dashboards

  • Reports

  • Presentations

Goal:

Ensure stakeholders understand the findings and recommendations.

Step 8 – Deployment / Action

A model that isn’t deployed creates no value. This phase operationalizes the analytical output.

Put insights into practice:

  • Launch a marketing campaign

  • Update a business policy

  • Automate predictions in software

  • Improve inventory strategy

  • Integrate the model or insights into business processes, systems, or dashboards

  • Set up API endpoints, batch scoring pipelines, or real-time inference systems

  • Train end users and document how to interpret and act on model outputs

  • Establish monitoring systems to detect performance degradation over time

Key idea:

Analytics creates value only when used in decision-making.

Step 9 – Monitoring and Improvement

After deployment:

  • Track outcomes (Track model performance metrics in production over time)

  • Monitor model performance (Conduct periodic audits for bias and fairness)

  • Update data and methods(Retrain or update models as new data becomes available)

  • Improve based on feedback (Feed lessons learned back into future projects)

  • Detect concept drift (when the statistical relationship between inputs and outputs changes)

Note:

The lifecycle is iterative and continuous.

📝 Simple flow:

Problem → Data → Clean → Explore → Model → Evaluate → Deploy → Monitor

1.9 Types of Analytics

There are four major types of analytics:

  1. Descriptive

  2. Diagnostic

  3. Predictive

  4. Prescriptive

These types answer different questions.

Descriptive Analytics

Question answered:

👉 “What happened?”

Descriptive analytics is the most foundational and widely practiced form. It summarizes historical data to understand what has occurred in the past.

  • Goal: Provide a clear picture of past events and current state

  • Methods: Aggregation, summarization, pivot tables, basic visualization, dashboards,Reports, KPIs

  • Tools: Excel, Tableau, Power BI, SQL, Google Data Studio

Examples in practice:

  • A retailer’s monthly sales report showing revenue by product category and region

  • A hospital dashboard showing daily patient admissions, discharges, and average length of stay

  • A social media analytics report showing follower growth, engagement rates, and top-performing posts

  • Monthly sales report showing total revenue and top-selling products

Descriptive Analytics – Key Points

Features:

  • Focuses on past and current data

  • Provides summaries and trends

  • Easy to understand

Limitation:

Does not explain why something happened

📌Note: Descriptive analytics answers ‘what’ but not ‘why’. It is the starting point for almost all analytical work and essential for data literacy across the organization. Most reporting and BI functions are primarily descriptive.

Diagnostic Analytics

Question answered:

👉 “Why did it happen?”

Diagnostic analytics goes deeper, investigating the causes and contributing factors behind observed outcomes.

  • Goal: Understand the root cause of a past event or trend (Identify causes and contributing factors)

  • Methods: Drill-down analysis, correlation analysis, data mining, hypothesis testing, Comparisons across time or groups

  • Tools: SQL (with subqueries and window functions), Python/R, statistical tests

Examples in practice:

  • Investigating why Q3 revenue dropped — discovering a supply chain disruption coincided with competitor price cuts

  • Analyzing why customer satisfaction scores fell — tracing it to longer wait times in the call cer after a staffing reduction

  • Identifying why a marketing campaign underperformed — finding that the target audience was incorrectly segmented

  • Why did sales drop?

Diagnostic analysis may reveal:

  • Price increase

  • Reduced advertising

  • Supply chain issues

  • Seasonal trends

Key value: Helps explain problems and opportunities

📌 Note: Diagnostic analytics often involves correlation analysis, but analysts must be careful not to confuse correlation with causation. Statistical correlation shows two variables move together; it does not prove one causes the other. Establishing causation requires controlled experiments or causal inference techniques

Predictive Analytics

Question answered:

👉 “What is likely to happen?”

Predictive analytics uses historical patterns to forecast future outcomes. This is where machine learning becomes central.

  • Goal: Forecast likely future events with quantifiable confidence (Forecast future outcomes using historical data)

  • Methods: Regression, classification, time series forecasting, Machine learning

  • Tools: Python (scikit-learn, statsmodels, Prophet), R, Azure ML, AWS SageMaker

Examples in practice:

  • A bank predicting which customers are likely to default on loans in the next 90 days

  • A retailer forecasting demand for each product SKU by store location for the upcoming holiday season

  • A telecom company predicting which customers are at high risk of churning in the next 30 days

  • A manufacturer predicting when industrial equipment will fail, enabling proactive maintenance

  • Predict customer churn

  • Forecast next quarter sales

  • Estimate loan default risk

  • Predict disease risk in patients

  • Predicting disease outbreaks

Key concepts in predictive analytics:

  • Training data: Historical data used to fit (train) the model — the model learns patterns from this data.

  • Test data: Held-out data never seen during training, used to evaluate how well the model generalizes to new examples.

  • Overfitting: When a model learns the training data too well (including its noise) and fails to generalize to new data. A model that memorizes rather than learns.

  • Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.

Benefit:Enables proactive decision-making

Prescriptive Analytics

Question answered:

👉 “What should we do?”

The most advanced form of analytics, prescriptive analytics goes beyond prediction to recommend specific actions that optimize outcomes.

  • Goal: Identify the optimal action or decision given constraints and objectives (Recommend actions that improve outcomes)

  • Methods: Mathematical optimization, simulation, reinforcement learning, decision trees, scenario analysis

  • Tools: Python (SciPy, PuLP, OR-Tools), specialized optimization solvers, simulation platforms

Examples in practice:

  • An airline dynamically pricing seats to maximize revenue given demand forecasts, competitor prices, and remaining inventory

  • A logistics company routing delivery vehicles to minimize fuel costs and delivery times simultaneously

  • A hospital scheduling surgeries and staff to maximize throughput while meeting patient safety standards

  • An investment algorithm allocating portfolio weights to maximize expected return for a given risk tolerance

  • Recommend discounts to at-risk customers

  • Suggest the best delivery route

  • Optimize staffing levels

  • Recommend stock levels

Benefit: Supports action and optimization

💡 Key Insight: Prescriptive analytics is often where AI meets decision automation. When a system can not only predict what will happen but also automatically take the best action in response — without human intervention — it becomes a true AI-driven decision system. This is the cutting edge of enterprise analytics

Comparison of the Four Types

Type Main Question Focus Example
Descriptive What happened? Summary of past data Sales dashboard
Diagnostic Why did it happen? Causes and reasons Root cause of sales decline
Predictive What is likely to happen? Forecasting Churn prediction
Prescriptive What should we do? Action recommendation Best retention strategy

The Analytics Maturity Model

Organizations typically progress through these four types over time as their data capabilities mature. Most businesses today are strong in descriptive analytics and actively building predictive capabilities. True prescriptive analytics at scale remains rare and represents a significant competitive advantage.

  • Level 1 — Descriptive: Basic reporting, spreadsheets, reactive decision-making

  • Level 2 — Diagnostic: Root-cause analysis, some SQL/BI tooling, more structured data teams

  • Level 3 — Predictive: ML models in production, data science teams, experimentation culture

  • Level 4 — Prescriptive: AI-driven decision systems, real-time optimization, closed-loop automation

Key idea: Organizations often use all four types together.

1.10 Key Terms & Definitions

  • Data: raw facts and figures

  • Information: processed data with meaning

  • Insight: useful understanding from analysis

  • KPI: key performance indicator

  • Dashboard: visual display of metrics

  • Forecasting: predicting future values

  • Model: mathematical or computational representation

  • Data Analytics: The process of examining, cleaning, transforming, and modeling data to discover useful information and support decision-making.

  • KPI (Key Performance Indicator): A measurable value that demonstrates how effectively an organization is achieving key business objectives.

  • Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize their main characteristics, often using visual methods, before formal modeling.

  • Feature Engineering: The process of using domain knowledge to create new input variables (features) from raw data to improve model performance.

  • Concept Drift: A phenomenon where the statistical properties of the target variable (what the model predicts) change over time, causing model performance to degrade.

  • Correlation: A statistical measure expressing the extent to which two variables are linearly related. Does not imply causation.

  • Causation: A relationship where one variable directly causes a change in another. Establishing causation requires controlled experiments or causal inference methods.

  • Overfitting: A modeling error where a model performs well on training data but poorly on new, unseen data because it has learned noise rather than signal.

  • ROI (Return on Investment): A performance measure used to evaluate the efficiency or profitability of an investment, often used to justify analytics initiatives.

1.11 Challenges in Adopting Data Analytics

Challenge Description
Data Quality Incomplete, inconsistent, or inaccurate data leads to unreliable insights (“Garbage In, Garbage Out”)
Data Silos Data trapped in departmental systems that don’t communicate
Talent Shortage Difficulty finding and retaining skilled data professionals
Privacy & Ethics Compliance with GDPR, CCPA, and ethical use of data
Organizational Culture Resistance to change from intuition-based to evidence-based decisions
Infrastructure Costs Investment in tools, cloud computing, and storage
Interpretability Stakeholders may not trust “black box” models

1.12 Summary

In this lesson, we learned:

  • Data analytics helps organizations make informed decisions

  • It plays a major role in business performance and innovation

  • The lifecycle includes problem definition through monitoring

  • The four types of analytics are:

    • Descriptive

    • Diagnostic

    • Predictive

    • Prescriptive

1.13 Quick Revision

Remember:

  • Data analytics turns raw data into useful insights

  • Good analytics starts with a clear problem

  • Data cleaning is essential

  • Descriptive and diagnostic focus on the past

  • Predictive and prescriptive focus on future decisions

1.14 Discussion / Review Questions

  1. What is data analytics?

  2. Why is data analytics important in organizations?

  3. What are the stages of the data analytics lifecycle?

  4. What is the difference between predictive and prescriptive analytics?

  5. Give one example of each type of analytics

  6. Describe three ways that data analytics creates competitive advantage for an organization. Provide a specific example for each.

  7. A company notices that ice cream sales and drowning rates are positively correlated. Does this mean ice cream causes drowning? Explain your reasoning and identify the actual cause of this correlation.

  8. A data team is asked to reduce customer churn. Walk through how you would apply each phase of the data analytics lifecycle to this problem.

  9. Classify each of the following as descriptive, diagnostic, predictive, or prescriptive analytics, and justify your answer:

    (a) A dashboard showing weekly website visitors.

    (b) A model that recommends which items to reorder and in what quantity.

    (c) An analysis identifying why last month’s marketing email had a low open rate.

    (d) A model predicting which loan applicants are likely to default.

  10. Why is data preparation often described as taking 60–80% of a data scientist’s time? What activities does this include, and what are the consequences of inadequate data preparation?

1.15 Further Reading & References

  • CRISP-DM Guide: Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a Standard Process Model for Data Mining.

  • Davenport, T.H., & Harris, J.G. (2007). Competing on Analytics. Harvard Business School Press.

  • Provost, F., & Fawcett, T. (2013). Data Science for Business. O’Reilly Media.

  • EMC Education Services. (2015). Data Science and Big Data Analytics. Wiley.

  • McKinsey Global Institute. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity.

Topic Two: Data Types and Data Collection

2.1 Introduction

Before any analysis can begin, a data professional must understand the nature and structure of data they are dealing with and where it came from. The wrong assumptions about data structure can invalidate an entire analysis. This chapter covers the landscape of data types, how data is collected, the characteristics of different data sources, and the foundational concepts of data quality and governance that determine whether data can be trusted.

The type of data determines:

  • How it is stored (databases, file systems, data lakes)

  • How it is processed (SQL queries, parsing, NLP, computer vision)

  • What tools and technologies are appropriate

  • What preprocessing steps are needed before analysis

  • What analytical methods can be applied

“Key Fact: According to IDC estimates, approximately 80–90% of all data generated today is unstructured or semi-structured. Yet most traditional analytics tools were designed for structured data. This gap is one of the primary drivers behind the growth of data science, ML, and AI”

2.2 Types of Data

2.2.1 Structured Data

Definition

Structured data is data that adheres to a predefined schema or data model, organized into rows and columns (tabular format), where each field has a defined data type and meaning.

It is the most traditional and well-understood form of data. It lives in relational databases and can be queried using SQL (Structured Query Language).

Characteristics

Property Description
Schema Predefined and rigid (schema-on-write)
Format Tabular — rows (records) and columns (fields/attributes)
Data Types Each column has a strict type (integer, float, varchar, date, boolean)
Storage Relational Database Management Systems (RDBMS)
Query Language SQL
Searchability Highly searchable and filterable
Scalability Vertical scaling; limited for massive volumes
Machine Readability Easily consumed by analytics tools and algorithms

Examples of Structured Data

Example 1: Customer Table in a Relational Database

CustomerID Name Email Age City SignupDate AccountType
1001 Alice Johnson alice@email.com 34 New York 2023-01-15 Premium
1002 Bob Smith bob@email.com 28 Chicago 2023-03-22 Free
1003 Carol Williams carol@email.com 45 Boston 2022-11-08 Premium

Example 2: Financial Transaction Records

TransactionID Date AccountNo Type Amount Currency Status
TXN-50001 2024-01-15 ACC-2234 Debit 250.00 USD Completed
TXN-50002 2024-01-15 ACC-1187 Credit 1200.50 USD Pending

Other Examples:

  • Spreadsheets (Excel, Google Sheets)

  • ERP system records (inventory, orders, invoices)

  • CRM data (customer profiles, interactions)

  • Sensor readings in fixed formats (temperature, pressure at timestamps)

  • Census data

  • Stock market tick data (OHLCV — Open, High, Low, Close, Volume)

Common Storage Systems for Structured Data

System Examples
RDBMS MySQL, PostgreSQL, Oracle, Microsoft SQL Server, SQLite
Cloud Data Warehouses Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse
Spreadsheets Microsoft Excel, Google Sheets
Flat Files CSV (Comma-Separated Values), TSV (Tab-Separated Values)

Data Types Within Structured Data

Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid

Type Subtype Description Examples Valid Operations
Qualitative Nominal Categories with no inherent order Gender, color, city, blood type, product category Mode, frequency count, chi-square test
Qualitative Ordinal Categories with a meaningful order but unequal intervals Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale Mode, median, rank correlation, non-parametric tests
Quantitative Discrete Countable, finite values (usually integers) Number of children, website clicks, defect count, cars owned Mean, median, mode, standard deviation, Poisson regression
Quantitative Continuous Measurable, can take any value in a range (infinite precision) Temperature, height, weight, salary, time Mean, std dev, correlation, regression, t-test, ANOVA

Advantages and Limitations

Advantages Limitations
Easy to query, filter, and aggregate (SQL) Rigid schema — hard to modify after creation
Well-understood tools and technologies Cannot represent complex, nested, or hierarchical data well
Strong data integrity (constraints, keys, types) Doesn’t handle multimedia data (images, audio, video)
Efficient indexing and searching Scaling horizontally (across servers) can be challenging
Mature ecosystem of tools Only represents ~10-20% of all organizational data

2.2.2 Semi-Structured Data

Definition

Semi-structured data has some organizational properties (tags, markers, metadata, or hierarchical structure) but does not conform to a rigid tabular schema. It is self-describing — the structure is embedded within the data itself.

Semi-structured data sits between structured and unstructured data. It has some organization but is more flexible than a relational table.

Characteristics

Property Description
Schema Flexible, implicit, self-describing (schema-on-read)
Format Hierarchical, nested, key-value pairs, tagged
Data Types Mixed and flexible; fields can vary between records
Storage NoSQL databases, document stores, file systems
Query Language Varies (JSONPath, XPath, MongoDB query language, etc.)
Searchability Searchable with appropriate tools but less straightforward than SQL
Flexibility High — new fields can be added without changing existing structure

Example: Web Server Log (Semi-Structured)

192.168.1.105 - alice [15/Jan/2024:10:23:45 +0000] "GET /products/laptop HTTP/1.1" 200 5432 "https://google.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" 192.168.1.110 - bob [15/Jan/2024:10:24:02 +0000] "POST /cart/add HTTP/1.1" 201 342 "https://example.com/products" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0)"

This data has a recognizable pattern (IP, user, timestamp, request, status code, referrer, user agent) but is not in a tabular database. Parsing is required to extract structured fields.

2.2.3 Unstructured Data

Definition

Unstructured data has no predefined data model, schema, or organizational structure. It cannot be stored in traditional row-column databases without significant transformation.

This is the most abundant type of data in the world and the hardest to analyze using traditional methods. Advances in AI, NLP, and computer vision have made it increasingly possible to extract insights from unstructured data.

Characteristics

Property Description
Schema None — no predefined structure
Format Free-form text, binary files, media
Storage File systems, object stores, data lakes, content management systems
Analysis Requires AI/ML techniques (NLP, computer vision, speech recognition)
Volume Comprises ~80-90% of all enterprise data
Searchability Difficult without indexing, tagging, or AI-based extraction
Category Examples
Text Emails (body), social media posts, reviews, news articles, legal contracts, medical notes, chat transcripts, books, research papers
Images Photographs, medical scans (X-rays, MRIs, CT scans), satellite imagery, product photos, handwritten documents, diagrams
Audio Phone call recordings, podcasts, music files, voice messages, voice assistant queries
Video Surveillance footage, YouTube videos, webinars, live streams, movie files
Other Geospatial data, 3D models, scientific instrument output, biometric data

How AI/ML Processes Unstructured Data

The core challenge with unstructured data is that machines cannot directly analyze raw text, images, or audio. These must be converted into numerical representations (features/vectors) first:

Data Type AI/ML Technique What It Does
Text Natural Language Processing (NLP) Tokenization, sentiment analysis, topic modeling, named entity recognition, text classification, machine translation
Text → Numbers Word Embeddings (Word2Vec, GloVe, BERT) Converts words/sentences into dense numerical vectors that capture semantic meaning
Images Computer Vision (CNNs) Object detection, image classification, facial recognition, segmentation
Images → Numbers Feature Extraction (ResNet, VGG) Converts images into numerical feature vectors using pre-trained neural networks
Audio Speech Recognition (ASR) Converts speech to text (e.g., Whisper, Google Speech API)
Audio → Numbers Spectrograms, MFCCs Converts audio waveforms into frequency-domain representations
Video Video Analysis (3D CNNs, RNNs) Action recognition, object tracking, scene understanding

Example: Turning Unstructured Text into Structured Data

Raw unstructured data (customer review)

"I absolutely love this laptop! The battery life is amazing and the  screen is gorgeous. However, the keyboard feels a bit cheap and the  trackpad is not very responsive. Overall, a great purchase for the price."

After NLP processing → Structured output:

Field Extracted Value
Overall Sentiment Positive (0.72)
Battery Sentiment Very Positive (0.95)
Screen Sentiment Very Positive (0.91)
Keyboard Sentiment Negative (-0.45)
Trackpad Sentiment Negative (-0.60)
Price Sentiment Positive (0.65)
Named Entities Product Type: Laptop
Key Topics Battery, Screen, Keyboard, Trackpad, Price

This transformation from unstructured text to structured data is one of the most important applications of NLP in data science.

Data Types Within Structured Data

Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid

Type Subtype Description Examples Valid Operations
Qualitative Nominal Categories with no inherent order Gender, color, city, blood type, product category Mode, frequency count, chi-square test
Qualitative Ordinal Categories with a meaningful order but unequal intervals Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale Mode, median, rank correlation, non-parametric tests
Quantitative Discrete Countable, finite values (usually integers) Number of children, website clicks, defect count, cars owned Mean, median, mode, standard deviation, Poisson regression
Quantitative Continuous Measurable, can take any value in a range (infinite precision) Temperature, height, weight, salary, time Mean, std dev, correlation, regression, t-test, ANOVA

2.2.4 Special Data Types

Type Description Examples
Binary Only two possible values Yes/No, True/False, 0/1, Male/Female
Temporal Date, time, datetime, timestamp 2024-01-15, 14:30:00, timestamps
Geospatial Location-based data Latitude/longitude, GPS coordinates, polygons
Currency Monetary values with specific precision $1,299.99, €45.50, Kshs1500
Text/String Character sequences (can be categorical or free-text) Names, descriptions, comments
Boolean Logical true/false values is_active, has_subscription

Why This Matters for ML: Algorithms require numerical input. Understanding data types determines how to encode categorical variables (one-hot encoding for nominal, label encoding for ordinal) and how to scale numerical variables (normalization, standardization).

2.3 Primary vs. Secondary Data

2.3.1 Primary Data

Definition

Primary data is data collected firsthand by the researcher or organization specifically for the current research question or business problem. It is original data that did not exist before the collection effort.

Characteristics

Property Description
Originality Collected for the first time, directly from the source
Specificity Tailored to the exact research question
Control Researcher controls methodology, sampling, and variables
Recency Typically the most current/up-to-date data available
Cost Generally more expensive and time-consuming to collect
Ownership The collector owns the data

Methods of Primary Data Collection

Method Description Best For Example
Surveys / Questionnaires Structured questions distributed to a sample Gathering opinions, preferences, demographics at scale Customer satisfaction survey, NPS score
Interviews In-depth, one-on-one or group conversations Deep qualitative insights, understanding “why” User research interviews for product design
Focus Groups Moderated group discussions (6-12 participants) Exploring perceptions, attitudes, new concepts Testing reactions to a new product concept
Experiments / A/B Tests Controlled manipulation of variables to measure effect Establishing causal relationships Testing two website layouts to see which converts better
Observations Systematically watching and recording behavior Understanding behavior in natural settings Recording how customers navigate a store
Sensor / IoT Data Collection Deploying instruments to measure physical phenomena Real-time monitoring, environmental data Installing temperature sensors in a warehouse
Web Scraping (owned properties) Automated extraction of data from your own platforms Collecting user interaction data Logging clickstream data on your website
Clinical Trials Controlled medical experiments Testing drug efficacy and safety Pharmaceutical Phase III trial
Field Research Collecting samples or measurements in the field Environmental, geological, agricultural research Soil sampling for agricultural analysis

Advantages and Disadvantages of Primary Data

Advantages Disadvantages
Directly relevant to research question Expensive (survey design, distribution, collection)
Researcher controls quality and methodology Time-consuming (weeks to months)
Most current and up-to-date Requires expertise in research design
Can target specific populations Subject to response bias, sampling bias
Proprietary — competitive advantage Typically smaller sample sizes than secondary data
Can collect exactly the variables needed Ethical considerations (consent, privacy, IRB approval)

Designing Good Primary Data Collection

Key Principles for Survey/Questionnaire Design:

  1. Define clear objectives — What exactly do you want to learn?

  2. Choose the right question types:

    • Closed-ended (multiple choice, Likert scale, yes/no) — Easy to analyze quantitatively

    • Open-ended (free text) — Rich qualitative data but harder to analyze

  3. Avoid leading questions — “Don’t you agree our product is excellent?” ❌

  4. Avoid double-barreled questions — “Is our product affordable and high-quality?” ❌ (these are two separate questions)

  5. Use simple, unambiguous language

  6. Consider question order — General to specific, easy to hard

  7. Pilot test before full deployment

  8. Ensure proper sampling — Random sampling, stratified sampling, etc.

2.3.2 Secondary Data

Definition

Secondary data is data that was originally collected by someone else for a different purpose and is being reused for the current analysis. The researcher accesses and analyzes existing data rather than collecting new data.

Characteristics

Property Description
Originality Pre-existing; not collected for the current purpose
Collection No direct collection effort needed
Cost Generally much cheaper (often free)
Speed Available immediately or quickly
Scale Often much larger datasets than primary collection allows
Control No control over how data was collected, what was measured, or quality

Sources of Secondary Data

Category Examples
Government & Public Institutions Census data, Bureau of Labor Statistics, World Bank, WHO, UN Data, data.gov, Eurostat
Academic & Research Published papers, university datasets, arXiv, Google Scholar, ICPSR
Industry Reports Gartner, McKinsey, Deloitte, PwC, Forrester, Nielsen reports
Company Internal Data Historical sales records, CRM data, past surveys, financial records (collected for operational purposes, now reused for analytics)
Open Data Platforms Kaggle, UCI ML Repository, Google Dataset Search, AWS Open Data, HuggingFace Datasets
Social Media & Web Twitter/X API data, Reddit, Wikipedia, Common Crawl
Financial Data Yahoo Finance, Bloomberg, SEC filings (EDGAR), stock exchange data
Geospatial Data OpenStreetMap, NASA Earthdata, Google Earth Engine
Healthcare Data MIMIC-III (clinical data), NIH databases, CDC data
Media & News News archives, GDELT project (global events database)

Advantages and Disadvantages of Secondary Data

Advantages Disadvantages
Significantly cheaper (often free) May not perfectly fit your research question
Available quickly — no collection time No control over data quality or methodology
Often very large datasets May be outdated
Enables historical and longitudinal analysis Definitions/categories may not match your needs
Can cover broad geographies and populations Potential biases from original collection unknown
Peer-reviewed or government-validated May have restrictions on use (licensing, privacy)
Good for benchmarking and comparison May lack variables you specifically need

Evaluating Secondary Data Quality

Before using secondary data, assess it critically:

Criterion Questions to Ask
Source credibility Who collected it? Is the source reputable? Government? Academic?
Purpose Why was it originally collected? Could the purpose introduce bias?
Methodology How was it collected? What sampling method was used?
Timeliness When was it collected? Is it still relevant?
Accuracy Are there known errors or limitations? Has it been peer-reviewed?
Consistency Are definitions and units consistent across time periods?
Completeness Are there significant gaps or missing data?
Accessibility Can you access the granularity you need? Are there licensing restrictions?

2.3.3 Primary vs. Secondary Data — Comparison

Dimension Primary Data Secondary Data
Collected by Researcher/organization for current purpose Someone else for a different purpose
Relevance Highly relevant and specific May not perfectly fit
Cost High Low (often free)
Time to obtain Weeks to months Hours to days
Data quality control Full control No control
Sample size Usually smaller Often very large
Recency Most current May be outdated
Ownership You own it May have usage restrictions
Bias awareness Known (you designed the study) Unknown or undocumented
Uniqueness Proprietary — competitive advantage Available to competitors too

2.3.4 When to Use Which?

Use Primary Data When… Use Secondary Data When…
No existing data answers your question Existing data adequately addresses your question
You need very specific variables You need broad coverage or historical data
Data quality is paramount Budget and time are limited
You need proprietary insights You need a starting point for exploratory analysis
Establishing causal relationships (experiments) Benchmarking against industry or population data
Regulatory requirements demand original data Supplementing primary data with contextual data

“Best Practice: Most data science projects use a combination of both. For example, a company might use its own customer transaction data (primary) enriched with census demographic data (secondary) and weather data (secondary) to build a predictive model.”

2.4 Data Sources: Surveys, APIs, Databases, and Web Data

2.4.1 Surveys and Forms

What Are Surveys?

Surveys are systematic methods of gathering information from a defined population through a set of structured or semi-structured questions, typically for research, feedback, or data collection purposes.

Types of Surveys

Type Description Advantages Limitations
Online Surveys Web-based questionnaires distributed via email, social media, or embedded in websites Cheap, fast, wide reach, easy analysis Low response rates, self-selection bias, no interviewer to clarify
Telephone Surveys (CATI) Computer-Assisted Telephone Interviewing Higher response rates than online, can clarify questions Expensive, declining landline usage, time-consuming
Face-to-Face Interviews In-person structured or semi-structured interviews Highest quality responses, non-verbal cues Very expensive, interviewer bias, not scalable
Mail Surveys Paper questionnaires sent and returned by mail Reaches populations without internet Slowest method, very low response rates
Mobile Surveys Optimized for smartphones High accessibility, in-the-moment capture Screen size limits complexity
Longitudinal / Panel Surveys Same participants surveyed repeatedly over time Tracks changes and trends over time Attrition of participants

Popular Survey Tools

Tool Key Features
Google Forms Free, simple, integrates with Google Sheets
SurveyMonkey Professional features, templates, analytics
Typeform Interactive, conversational UI, good UX
Qualtrics Enterprise-grade, advanced logic, research-focused
Microsoft Forms Integrated with Microsoft 365
LimeSurvey Open-source, self-hosted option
REDCap Specialized for clinical and academic research

Sampling Methods for Surveys

Method Type Description
Simple Random Probability Every member has an equal chance of selection
Stratified Random Probability Population divided into strata; random sample from each stratum
Cluster Probability Population divided into clusters; entire clusters randomly selected
Systematic Probability Every kth member selected from a list
Convenience Non-Probability Whoever is available/easiest to reach
Snowball Non-Probability Existing participants recruit future participants
Quota Non-Probability Sample selected to match known population proportions
Purposive/Judgmental Non-Probability Researcher selects participants based on judgment

Key Consideration: Probability sampling allows statistical inference to the broader population. Non-probability sampling is easier and cheaper but results cannot be generalized with the same confidence.

Common Survey Biases

Bias Description Mitigation
Selection Bias Sample not representative of the population Use probability sampling
Response Bias Respondents answer inaccurately (social desirability, acquiescence) Anonymize, use neutral wording
Non-Response Bias Those who respond differ systematically from those who don’t Follow-up reminders, incentives, analyze non-respondents
Leading Question Bias Questions that suggest a desired answer Neutral question wording, pilot testing
Recall Bias Respondents don’t accurately remember past events Use shorter recall periods, provide reference points
Survivorship Bias Only surveying current customers, not those who left Include churned/former customers
Order Effects Answer influenced by position in questionnaire Randomize question order

2.4.2 APIs (Application Programming Interfaces)

What Is an API?

An API (Application Programming Interface) is a set of defined rules, protocols, and tools that allows different software applications to communicate with each other and exchange data in a structured, programmatic way.

For data scientists, APIs are a primary mechanism for programmatically accessing data from external services, platforms, and databases.

Popular APIs for Data Science

Category API Data Provided
Social Media Twitter/X API, Reddit API, Meta Graph API Posts, tweets, user data, engagement metrics
Finance Alpha Vantage, Yahoo Finance, Polygon.io, Quandl Stock prices, financial statements, crypto data
Weather OpenWeatherMap, WeatherAPI, NOAA Current weather, forecasts, historical weather
Maps & Location Google Maps API, OpenStreetMap, Mapbox Geocoding, directions, places, traffic
NLP & AI OpenAI API (GPT), Google Cloud NLP, HuggingFace Text generation, sentiment analysis, translation
Government Census API, data.gov, World Bank API Demographics, economic indicators, health data
E-commerce Amazon Product API, Shopify API, eBay API Product data, pricing, reviews
News NewsAPI, GDELT, NYTimes API News articles, headlines, events
Music Spotify API, Last.fm API Song data, playlists, listening history
Sports ESPN API, SportRadar, NBA Stats API Scores, player statistics, game data

API Authentication Methods

Method Description Security Level
API Key Simple key passed as a query parameter or header Basic
OAuth 2.0 Token-based authorization; user grants permission High
Bearer Token Token included in the HTTP Authorization header Medium-High
Basic Auth Username and password encoded in base64 Low
JWT (JSON Web Token) Self-contained token with encoded user info High

API Best Practices for Data Collection

Practice Description
Respect Rate Limits Most APIs limit requests per minute/hour; implement backoff strategies
Cache Responses Store API responses locally to avoid redundant calls
Handle Errors Gracefully Implement try/except blocks, retry logic, and logging
Paginate Large Requests Many APIs return data in pages; loop through all pages
Secure Your Keys Never hardcode API keys; use environment variables or secret managers
Read Documentation Always read the API docs thoroughly before coding
Monitor Usage Track API consumption to avoid exceeding quotas or incurring costs
Version Awareness APIs can change; pin to specific versions when possible

2.4.3 Databases and Data Stores

What Is a Database?

A database is an organized collection of data stored and accessed electronically, managed by a Database Management System (DBMS) that provides mechanisms for storing, retrieving, updating, and managing data.

Relational Databases (RDBMS)

The backbone of structured data storage for decades. Data is organized into tables with relationships between them.

2.4.4 Core Concepts

Concept Description
Table (Relation) A collection of related data organized in rows and columns
Row (Record/Tuple) A single data entry in a table
Column (Field/Attribute) A specific property/variable in a table
Primary Key (PK) A unique identifier for each row in a table
Foreign Key (FK) A column that references the primary key of another table (creates relationships)
Index A data structure that speeds up data retrieval
Schema The blueprint/structure of the database (tables, columns, types, constraints)
View A virtual table based on the result of a SQL query
Stored Procedure Pre-compiled SQL code stored in the database

Popular Relational Databases

Database Type Best For
PostgreSQL Open source General purpose, advanced features, geospatial
MySQL Open source Web applications, WordPress, scalable reads
SQLite Embedded Local applications, prototyping, mobile apps
Microsoft SQL Server Commercial Enterprise Windows environments
Oracle Database Commercial Large enterprise, financial services
MariaDB Open source (MySQL fork) Drop-in MySQL replacement

2.4.5 Summary: Choosing the Right Data Source

Factor Consideration
Research Question What data do you actually need to answer your question?
Availability Does the data exist? Is it accessible?
Quality How reliable, complete, and accurate is the data?
Cost What is the budget for data acquisition?
Time How quickly do you need the data?
Legal/Ethical Are there privacy, licensing, or ethical constraints?
Format Can you work with the data format, or is significant transformation needed?
Scale Is the data volume appropriate for your analysis needs?
Freshness How current does the data need to be?

2.5 Data Quality and Data Governance Concepts

Why Data Quality Matters

“Garbage In, Garbage Out” (GIGO)— The most sophisticated algorithm in the world will produce meaningless results if fed poor-quality data.

Data quality is not just a technical concern — it has real business impact:

Impact Area Consequence of Poor Data Quality
Decision-Making Wrong conclusions lead to wrong decisions
Financial Gartner estimates poor data quality costs organizations an average of $12.9 million per year
Customer Experience Wrong addresses, duplicate communications, personalization failures
Regulatory Compliance violations (fines under GDPR can reach €20M or 4% of global revenue)
Model Performance ML models trained on dirty data produce unreliable predictions
Operational Failed processes, reconciliation delays, manual workarounds
Trust Stakeholders lose confidence in analytics and reporting
Opportunity Cost Data scientists spend 60-80% of their time cleaning data instead of analyzing it

2.5.1 Dimensions of Data Quality

Data quality is multidimensional. A dataset may score well on one dimension but poorly on another. The most widely recognized dimensions are:

The Six Core Dimensions (DAMA Framework)

# Dimension Definition Example of Poor Quality How to Measure
1 Accuracy Data correctly represents the real-world entity or event it models Customer age recorded as 250; address doesn’t match actual location % of records matching a verified source; error rate
2 Completeness All required data is present; no critical values are missing 30% of customer records have no email address; missing ZIP codes % of non-null values; % of records with all required fields
3 Consistency Data does not contradict itself across systems or within a dataset Customer listed as “Active” in CRM but “Cancelled” in billing system; “NY” vs “New York” Cross-system reconciliation; # of conflicting records
4 Timeliness Data is up-to-date and available when needed Using 2019 market data for 2024 decisions; dashboard refreshed weekly instead of hourly Data age; refresh frequency; latency
5 Validity Data conforms to defined formats, ranges, and business rules Email without “@” symbol; age = -5; date format “13/25/2024” % passing validation rules; # of constraint violations
6 Uniqueness Each entity is represented only once (no duplicates) Same customer appears 3 times with slightly different names Duplicate rate; # of records after deduplication

Additional Quality Dimensions

Dimension Definition Example
Integrity Relationships between data elements are maintained (referential integrity) An order references a customer_id that doesn’t exist in the customer table
Relevance Data is applicable and useful for the intended purpose Collecting shoe size data for a financial fraud model
Precision Level of detail/granularity is appropriate Recording revenue as “about $1M” vs “$1,023,456.78”
Conformity Data follows standard formats and naming conventions Dates stored as “Jan 15, 2024”, “2024-01-15”, “15/01/2024” inconsistently
Auditability Data lineage and changes can be traced No log of who changed a record or when

Common Data Quality Issues

Issue Description Example Impact
Missing Values Null, blank, or absent data points Empty phone number field; NULL income Biased analysis; model errors
Duplicates Same entity recorded multiple times “John Smith” and “Jon Smith” at same address Inflated counts; wasted marketing spend
Inconsistent Formats Same information represented differently “USA”, “United States”, “US”, “U.S.A.” Grouping/aggregation errors
Outliers Extreme values that may or may not be valid Salary of $10,000,000 for a junior analyst Skewed statistics; model distortion
Stale Data Data that is no longer current Customer address from 5 years ago Failed deliveries; wrong analysis
Incorrect Data Factually wrong values Birth year 2095; negative quantities Wrong conclusions; compliance risk
Encoding Issues Character set or encoding problems “Café” instead of “Café”; garbled text Data loss; parsing failures
Schema Changes Data structure changes without documentation New column added; column renamed Pipeline failures; broken queries
Unit Mismatches Different measurement units mixed Temperature in Celsius and Fahrenheit in same column Mathematical errors
Selection Bias Data doesn’t represent the target population Only surveying English-speaking users Biased models; unfair outcomes
Label Errors Incorrect labels in supervised learning data Image of a cat labeled as “dog” Poor model training

Data Quality Assessment and Profiling

Data profiling is the process of examining data to understand its structure, content, quality, and relationships. It is the first step in any data quality improvement effort.

Data Profiling Techniques

Technique What It Reveals
Column Analysis Data type, % null, distinct values, min/max, mean/median, distribution
Pattern Analysis Common formats, regex patterns, unexpected characters
Frequency Analysis Most/least common values, distribution of categories
Cross-Column Analysis Correlations, dependencies, functional relationships
Cross-Table Analysis Referential integrity, join quality, orphan records
Temporal Analysis Trends over time, gaps in time series, seasonality
Rule-Based Validation Checking against predefined business rules

2.5.2 Data Profiling Example in Python

*import pandas as pd import numpy as np

Load dataset

df = pd.read_csv(‘customer_data.csv’)

=== BASIC PROFILING ===

Shape and types

print(f”Shape: {df.shape}“) print(f”\nData Types:\n{df.dtypes}“)

Missing values analysis

missing = df.isnull().sum() missing_pct = (df.isnull().sum() / len(df)) * 100 missing_report = pd.DataFrame({ ‘Missing Count’: missing, ‘Missing %’: missing_pct.round(2) }).sort_values(‘Missing %’, ascending=False) print(f”\nMissing Values:\n{missing_report}“)

Summary statistics

print(f”\nNumerical Summary:\n{df.describe()}“) print(f”\nCategorical Summary:\n{df.describe(include=‘object’)}“)

Duplicate detection

duplicates = df.duplicated().sum() print(f”\nDuplicate Rows: {duplicates} ({(duplicates/len(df)*100):.2f}%)“)

Unique values per column

for col in df.columns: n_unique = df[col].nunique() print(f”{col}: {n_unique} unique values ({(n_unique/len(df)*100):.1f}%)“)

Value distribution for categorical columns

for col in df.select_dtypes(include=‘object’).columns: print(f”\n{col} - Top 10 Values:“) print(df[col].value_counts().head(10))

Outlier detection using IQR

for col in df.select_dtypes(include=[np.number]).columns: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = ((df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)).sum() print(f”{col}: {outliers} outliers ({(outliers/len(df)*100):.2f}%)“)*

Why Data Quality Matters

*# === AUTOMATED PROFILING TOOLS ===

Using pandas-profiling (ydata-profiling)

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title=“Customer Data Quality Report”) profile.to_file(“data_quality_report.html”)

Using Great Expectations (rule-based validation)

import great_expectations as gx

context = gx.get_context() # Define expectations validator = context.sources.pandas_default.read_dataframe(df) validator.expect_column_values_to_not_be_null(“customer_id”) validator.expect_column_values_to_be_between(“age”, min_value=0, max_value=120) validator.expect_column_values_to_match_regex(“email”, r”^[\w\.-]+@[\w\.-]+\.\w+$“) validator.expect_column_values_to_be_in_set(”status”, [“Active”, “Inactive”, “Suspended”]) validator.expect_column_values_to_be_unique(“customer_id”)*

2.5.3 Data Cleaning Strategies

Issue Strategy Python Example
Missing Values Drop, impute (mean/median/mode), forward/backward fill, predictive imputation df['age'].fillna(df['age'].median(), inplace=True)
Duplicates Identify and remove exact and fuzzy duplicates df.drop_duplicates(subset=['email'], keep='first', inplace=True)
Inconsistent Categories Standardize and map values df['country'].replace({'US':'United States', 'USA':'United States'}, inplace=True)
Outliers Remove, cap/floor (winsorize), or transform df['income'] = df['income'].clip(lower=0, upper=df['income'].quantile(0.99))
Wrong Data Types Cast to correct types df['date'] = pd.to_datetime(df['date'])
Whitespace/Formatting Strip and normalize df['name'] = df['name'].str.strip().str.title()
Invalid Values Validate against rules; replace or flag df.loc[df['age'] < 0, 'age'] = np.nan

2.5.4 Data Governance

Definition

Data Governance is the overall management of the availability, usability, integrity, quality, and security of data used in an organization. It establishes the policies, processes, standards, roles, and metrics that ensure effective and efficient use of data.

Data governance is not just a technology problem — it is an organizational discipline that encompasses people, processes, and technology.

2.5.5 Data Governance Policies

Policy Area Description Examples
Data Classification Categorizing data by sensitivity level Public, Internal, Confidential, Restricted/Secret
Data Access Control Who can access what data and under what conditions Role-based access (RBAC), need-to-know basis
Data Retention How long data is kept and when it is archived/deleted Financial records retained for 7 years; logs for 90 days
Data Privacy How personal data is collected, used, stored, and shared Consent management, anonymization, right to deletion
Data Quality Standards Minimum quality thresholds for data to be used Completeness > 95%, accuracy verified quarterly
Data Sharing Rules for sharing data internally and externally Data sharing agreements, anonymization requirements
Acceptable Use How data may and may not be used No using health data for marketing without explicit consent
Master Data Management Standards for maintaining master/reference data Single source of truth for customer, product data

2.5.6 Data Privacy and Regulatory Compliance

Major Data Privacy Regulations

Regulation Region Key Requirements
GDPR (General Data Protection Regulation) European Union (2018) Consent, right to access/delete/port data, data protection by design, breach notification within 72 hours, DPO appointment
CCPA / CPRA (California Consumer Privacy Act / California Privacy Rights Act) California, USA (2020/2023) Right to know, delete, opt-out of data sale, non-discrimination
HIPAA (Health Insurance Portability and Accountability Act) USA Protects health information (PHI); strict security and access controls

GDPR Key Principles

Principle Description
Lawfulness, Fairness, Transparency Data must be processed legally, fairly, and transparently
Purpose Limitation Data collected for specific, explicit purposes only
Data Minimization Only collect data that is necessary for the stated purpose
Accuracy Data must be accurate and kept up to date
Storage Limitation Data should not be kept longer than necessary
Integrity & Confidentiality Data must be protected against unauthorized access, loss, or damage
Accountability Organizations must demonstrate compliance

Data Security Fundamentals

Security Measure Description
Encryption Encrypting data at rest (stored) and in transit (transmitted) — AES-256, TLS/SSL
Access Control Role-Based Access Control (RBAC); principle of least privilege
Authentication Verifying identity — passwords, MFA (multi-factor authentication), SSO
Authorization Determining what authenticated users are allowed to do
Anonymization Removing personally identifiable information (PII) irreversibly
Pseudonymization Replacing identifiers with pseudonyms (reversible with a key)
Data Masking Hiding sensitive data (e.g., showing only last 4 digits of SSN: XXX-XX-1234)
Audit Logging Recording who accessed what data, when, and what they did
Network Security Firewalls, VPNs, intrusion detection systems
Backup & Recovery Regular backups with tested recovery procedures

2.5.7 Data Ethics

Beyond legal compliance, data practitioners must consider ethical responsibilities:

2.5.8 Key Ethical Principles

Principle Description
Transparency Be clear about what data you collect, why, and how it will be used
Fairness Ensure analyses and models do not discriminate against protected groups
Privacy Respect individuals’ right to control their personal information
Consent Obtain informed consent before collecting personal data
Accountability Take responsibility for the outcomes of data-driven decisions
Beneficence Ensure data use creates benefit and minimizes harm
Data Minimization Only collect and retain data that is necessary
Human Oversight Maintain human review for high-stakes automated decisions

2.6 Summary

  1. Data comes in three structural forms:

    • Structured (~10-20%): Tables with rows and columns, stored in RDBMS, queried with SQL

    • Semi-Structured (~5-10%): JSON, XML, logs — some organization but flexible schema

    • Unstructured (~80-90%): Text, images, audio, video — requires AI/ML to analyze

  2. Primary data is collected firsthand for your specific purpose (high relevance, high cost). Secondary data is pre-existing data collected by others (lower cost, may not perfectly fit).

  3. Data sources are diverse — surveys, APIs, databases, web scraping, IoT sensors, public datasets. The choice depends on the research question, cost, quality, and legal/ethical constraints.

  4. Data quality has multiple dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Poor data quality has significant financial and operational consequences.

  5. Data governance is the organizational framework (people, policies, processes, technology) that ensures data is managed as a strategic asset — covering quality, security, privacy, compliance, and ethics.

  6. Privacy regulations (GDPR, CCPA, HIPAA, etc.) impose strict requirements on how personal data is collected, processed, stored, and shared. Non-compliance carries severe penalties.

  7. Ethics in data science goes beyond legal compliance — practitioners must consider fairness, transparency, consent, bias, and the potential for harm in all data-related activities.

  8. Data catalogs and lineage help organizations discover, understand, trust, and trace their data assets.

2.7 Review Questions

  1. A hospital stores patient records in a relational database (name, DOB, diagnosis codes), medical images (X-rays, MRIs) in a file system, and doctor’s notes as free-text documents. Classify each data type as structured, semi-structured, or unstructured. What different tools/techniques would be needed to analyze each?

  2. A startup wants to build a model predicting restaurant success in a new city. Propose a data collection strategy that uses both primary and secondary data. What specific sources would you recommend?

  3. You receive a customer dataset with 2 million records. Describe the step-by-step data profiling process you would follow to assess its quality. What specific checks would you perform?

  4. A company’s marketing team wants to purchase third-party data about consumer spending habits to enrich their customer profiles. What data governance and ethical considerations should be evaluated before proceeding?

  5. Design a simple data governance framework for a mid-sized e-commerce company. What roles, policies, and tools would you recommend?

  6. Explain how data lineage could help a data analyst debug a dashboard that is showing incorrect revenue figures.

Exercise 1: Data Profiling

Download the Titanic dataset from Kaggle and perform a complete data quality assessment using Python. Report on missing values, data types, duplicates, outliers, and inconsistencies.

Exercise 2: API Data Collection

Write a Python script that collects weather data for 5 cities using the OpenWeatherMap API, stores the results in a pandas DataFrame, and exports to CSV.

Exercise 3: Data Cleaning Pipeline

Given a messy dataset with missing values, duplicates, inconsistent formatting, and outliers, write a Python data cleaning pipeline that addresses all issues and produces an analysis-ready dataset.

2.8 Further Reading & References

  • DAMA International. (2017). DAMA-DMBOK: Data Management Body of Knowledge (2nd Edition). Technics Publications.

  • Redman, T.C. (2008). Data Driven: Profiting from Your Most Important Business Asset. Harvard Business Press.

  • O’Reilly. (2022). Fundamentals of Data Engineering. Reis, J. & Housley, M.

  • Ladley, J. (2019). Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program (2nd Edition). Academic Press.

  • European Commission. GDPR Official Text: https://gdpr.eu/

  • Great Expectations Documentation: https://greatexpectations.io/

Topic Three: Data Cleaning and Preprocessing

3.1 Handling Missing Data

Missing values can arise from data entry errors, system issues, or unavailable information.

Common approaches

  • Remove rows/columns

    • Use when missingness is small or the feature is not important.
  • Imputation

    • Numerical data: mean, median, interpolation

    • Categorical data: mode, “Unknown” category

    • Advanced methods: KNN imputation, model-based imputation

  • Flag missingness

    • Add a binary feature indicating whether a value was missing.

Key consideration

Understand whether data is:

  • MCAR: Missing Completely at Random

  • MAR: Missing at Random

  • MNAR: Missing Not at Random

3.2. Outlier Detection and Treatment

Outliers are data points that differ significantly from other observations. They can be genuine extreme cases or simple errors.

Detection methods

  • Statistical

    • Z-Score: Identifying points that fall more than 3 standard deviations from the mean)

    • IQR Method: Defining outliers as points falling below \(Q1 - 1.5 \times IQR\) or above

  • Visualization

    • Boxplots

    • Scatterplots

    • Histograms

  • Model-based

    • Isolation Forest

    • DBSCAN

    • Local Outlier Factor

Treatment options

  • Remove if clearly erroneous

  • Cap/winsorize extreme values

  • Transform data

  • Keep them if they are valid and meaningful

3.3 Data Transformation and Normalization

Used to make data suitable for analysis or machine learning.

Transformation techniques

  • Log Transformation: Used to handle skewed data and help it approximate a normal distribution.

  • Square root / Box-Cox / Yeo-Johnson: stabilize variance

  • Binning: convert continuous values into intervals

Scaling / Normalization

Most machine learning algorithms (like SVM or K-Means) are sensitive to the scale of data. If one feature ranges from 0–1 and another from 0–10,000, the larger scale will dominate the model.

  • Min-Max Scaling: rescales to [0,1]

  • Standardization: mean = 0, std = 1

  • Robust Scaling: uses median and IQR; useful with outliers

3.4. Feature Creation and Encoding

  • Computers process numbers, not text. Therefore, categorical “labels” must be converted into numerical formats

  • Improves model performance by making raw data more informative.

Feature creation

  • Date-based features: year, month, day, weekday

  • Aggregations: totals, averages, counts

  • Interaction terms: multiply or combine variables

  • Domain-specific engineered variables

Encoding categorical variables

  • Label Encoding: assign numeric labels [Assigning a unique integer to each category (e.g., Red=1, Blue=2). Best for ordinal data where order matters (e.g., Small, Medium, Large)]

  • One-Hot Encoding: create binary columns

  • Ordinal Encoding: for ordered categories

  • Target / Frequency Encoding: useful for high-cardinality categories

3.5 Reproducible Data Workflows

Ensures preprocessing is consistent, traceable, and reusable.

Best practices

  • Keep raw data unchanged

  • Automate preprocessing with scripts/pipelines

  • Document assumptions and steps

  • Use version control (e.g., Git)

  • Set random seeds for reproducibility

  • Use notebooks carefully; move final logic into reusable code

  • Track data and model versions

Tools often used

  • Python: pandas, scikit-learn, numpy R: tidyverse, caret, tidymodels

  • Workflow tools: Pipeline, ColumnTransformer

  • Experiment/data tracking: MLflow, DVC

3.6 Summary

Data cleaning and preprocessing typically include:

  1. Fixing or imputing missing values

  2. Detecting and handling outliers

  3. Transforming and scaling data

  4. Creating useful features and encoding categories

  5. Building reproducible workflows for consistency