Topic One: Introduction to Data Analytics
Learning Objectives
By the end of this topic, students should be able to:
Define data analytics
Explain the role of data analytics in organizations
Describe the data analytics lifecycle
Differentiate between descriptive, diagnostic, predictive, and prescriptive analytics
1.1 What is Data Analytics?
Data analytics is the process of collecting, cleaning, transforming, and analyzing data to generate insights and support decision-making. It bridges the gap between raw data and actionable insight. Data analytics is one of the most transformative disciplines in modern business and science.
Main purpose:
Understand past events
Explain causes
Predict future outcomes
Recommend actions
Why organizations use it:
Helps in decision-making (data > guesswork)
Improves efficiency (find and fix problems)
Understands customers better
Detects risks & fraud
Supports planning for the future
Tracks performance (KPIs)
📝 In short:
Data analytics turns raw data → useful insights → better actions
1.2 Data Analytics in the Context of DS, ML, and AI
Relationship:
Data Analytics → extracts insights from data
Data Science → broader field combining statistics, programming, and domain knowledge
Machine Learning → builds models that learn from data
Artificial Intelligence → enables intelligent systems and automated decisions
Key idea:
Data analytics provides the foundation for ML and AI by preparing and interpreting data.
1.3 Why Data Analytics Matters in Organizations
Organizations today generate and have access to massive volumes of data. The ability to harness this data for actionable insight provides a strategic competitive advantage.
Key Reasons Organizations Invest in Data Analytics
| # | Reason | Description |
|---|---|---|
| 1 | Informed Decision-Making | Replace gut-feeling decisions with evidence-based strategies |
| 2 | Cost Reduction | Identify inefficiencies, optimize supply chains, reduce waste |
| 3 | Revenue Growth | Discover new market opportunities, optimize pricing, upselling |
| 4 | Risk Management | Detect fraud, predict failures, assess credit risk |
| 5 | Customer Understanding | Segment customers, personalize experiences, predict churn |
| 6 | Operational Efficiency | Streamline processes, predictive maintenance, workforce optimization |
| 7 | Innovation | Identify trends, develop new products, explore new business models |
Organizations use data analytics to:
Make better decisions
Improve efficiency
Understand customers
Measure performance
Gain competitive advantage
Example:
A retail company analyzes sales data to decide which products to restock.
1.4 Data Analytics Across Industries
Data analytics is not limited to technology companies. Its applications span virtually every sector:
Healthcare
Predicting disease outbreaks and patient readmission
Drug discovery and genomics research
Optimizing hospital resource allocation
Personalized treatment plans
Finance & Banking
Credit scoring and loan approval
Algorithmic trading
Fraud detection in real time
Anti-money laundering (AML) compliance
Retail & E-Commerce
Customer segmentation and targeting
Recommendation engines (e.g., Amazon, Netflix)
Demand forecasting and inventory management
Dynamic pricing strategies
Manufacturing
Predictive maintenance of equipment
Quality control through sensor data
Supply chain optimization
Digital twins for simulation
Government & Public Sector
- Census data analysis and urban planning
- Crime pattern analysis and predictive policing
- Tax fraud detection
- Public health monitoring
Telecommunications
Network optimization
Customer churn prediction
Sentiment analysis of customer feedback
Sports
Player performance analytics (e.g., Moneyball)
Injury prediction
Fan engagement optimization
1.5 Role of Data Analytics in Organizations
1. Better decision-making
Supports evidence-based decisions (Helps managers make informed decisions)
Reduces reliance on guesswork
📝 Example: Choosing the best product to invest in
2. Process improvement/Improving Efficiency
- Finds bottlenecks and inefficiencies
- Identifies waste, delays, and bottlenecks
- Optimizes processes
📝 Example: Reducing delivery time in logistics
3. Customer understanding
- Identifies customer needs and preferences
- Analyzes customer behavior & preferences
- Helps in customer segmentation
📝 Example: Recommending products/services
4. Risk management
Detects fraud, errors, and unusual patterns/behavior
Supports safer operations
📝 Example: Fraud detection in banking
5. Strategic Planning
Helps in long-term planning and forecasting
Identifies market trends
📝 Example: Deciding future business expansion
6. Performance Measurement
Tracks progress using KPIs
Evaluates success of strategies
📝 Example: Monitoring sales performance
More Organizational Uses by Departments using analytics:
Marketing: customer segmentation, campaign analysis
Finance: forecasting, fraud detection
Operations: inventory and supply chain optimization
HR: employee performance and retention
Healthcare: patient analysis and disease prediction
1.6 The Data-Driven Organization: Key Characteristics
Research by McKinsey, MIT, and others consistently shows that data-driven organizations share common traits:
Data is treated as a strategic asset, with governance and quality standards
Decision-making at all levels is supported (not replaced) by data
A culture of experimentation exists — hypotheses are tested, not just assumed
Cross-functional data teams collaborate with domain experts
Investment in data infrastructure: pipelines, warehouses, and visualization tools
1.7 Key Roles in a Data Analytics Team
Modern analytics organizations typically include the following roles, though boundaries often overlap:
Data Analyst: Explores and summarizes data, builds dashboards and reports, answers specific business questions using SQL, Excel, and BI tools.
Data Scientist: Builds statistical and machine learning models to generate predictions, uncover patterns, and automate decisions. Requires strong coding and math skills.
Data Engineer: Designs and maintains the pipelines and infrastructure that move, store, and process data reliably at scale.
ML Engineer: Production machine learning models — taking a data scientist’s prototype and deploying it into reliable, scalable systems.
1.8 Data Analytics Lifecycle
The data analytics lifecycle is the sequence of steps followed in an analytics project.
Typical stages:
Problem definition
Data collection
Data preparation
Data exploration
Analysis/modeling
Interpretation
Communication
Deployment
Monitoring
Step 1 – Problem Definition
Business Understanding
Every data project starts not with data, but with a question. This phase defines what success looks like.
Clearly articulate the business problem or opportunity
Define measurable objectives and KPIs (Key Performance Indicators)
Identify stakeholders and understand their constraints
Determine whether the problem requires analytics at all — sometimes the answer is simpler
Agree on how the output will be used and by whom
Before analyzing data, define:
What problem needs to be solved?
What is the business goal?
What decisions will depend on the analysis?
How will success be measured?
Example:
Reduce customer churn in a telecom company.
Step 2 – Data Collection
Gather relevant data from different sources:
Databases
Spreadsheets
Websites
Sensors
APIs
Surveys
Social media
Logs
Goal: Collect data relevant to the problem.
Data Understanding
Once the problem is defined, the team explores what data exists and what additional data may be needed.
Inventory available data sources (internal databases, APIs, third-party feeds)
Conduct Exploratory Data Analysis (EDA) to understand distributions, ranges, and patterns
Assess data quality: completeness, accuracy, consistency, timeliness
Identify missing values,outliers, and anomalies
Document data lineage and metadata
Step 3 – Data Cleaning & Preparation
Often the most time-consuming phase, data preparation (or ‘data wrangling’) transforms raw data into a form suitable for analysis. Industry surveys consistently report that data professionals spend 60–80% of their time on this phase.
👉 Raw data often contains problems.
Common tasks:
Data cleaning: handling missing values, correcting errors, removing duplicates, Handle outliers, Standardize formats
Feature engineering: creating new variables from existing ones to better capture patterns.
Data transformation: normalization, encoding categorical variables, log transforms.
Data integration: joining data from multiple sources into a unified dataset (Merge datasets).
Splitting data: creating training, validation, and test sets for modeling
Important note:
Poor-quality data leads to poor analysis.
Step 4 – Data Exploration
Purpose: Understand the structure and patterns in the data.
Activities:
Summary statistics
Visualizations
Correlation checks
Trend analysis
Detect anomalies
Output: Early insights and hypotheses
Step 5 – Data Analysis / Modeling
Methods used:
Statistical analysis
Dashboards and reporting
Forecasting
Machine learning models
Optimization
Goal: Answer the business question using appropriate techniques.
Modeling
With clean, prepared data, analysts and data scientists build the analytical or machine learning models that address the business question.
Select appropriate modeling techniques (regression, classification, clustering, time series, etc.)
Train models on training data
Tune hyperparameters and validate model choices
Compare models using appropriate metrics (accuracy, RMSE, AUC-ROC, F1-score, etc.)
Document model assumptions and limitations
Step 6 – Interpretation of Results
Ask:
What do the results mean?
Are the findings reliable?
Do they answer the original question?
What actions should follow?
Key point:
Results must be linked to business value.
Step 7 – Communication and Visualization
Results should be presented clearly using:
Charts
Graphs
Dashboards
Reports
Presentations
Goal:
Ensure stakeholders understand the findings and recommendations.
Step 8 – Deployment / Action
A model that isn’t deployed creates no value. This phase operationalizes the analytical output.
Put insights into practice:
Launch a marketing campaign
Update a business policy
Automate predictions in software
Improve inventory strategy
Integrate the model or insights into business processes, systems, or dashboards
Set up API endpoints, batch scoring pipelines, or real-time inference systems
Train end users and document how to interpret and act on model outputs
Establish monitoring systems to detect performance degradation over time
Key idea:
Analytics creates value only when used in decision-making.
Step 9 – Monitoring and Improvement
After deployment:
Track outcomes (Track model performance metrics in production over time)
Monitor model performance (Conduct periodic audits for bias and fairness)
Update data and methods(Retrain or update models as new data becomes available)
Improve based on feedback (Feed lessons learned back into future projects)
Detect concept drift (when the statistical relationship between inputs and outputs changes)
Note:
The lifecycle is iterative and continuous.
📝 Simple flow:
Problem → Data → Clean → Explore → Model → Evaluate → Deploy → Monitor
Overview of Popular Lifecycle Models
| Model | Origin | Phases |
|---|---|---|
| CRISP-DM | Cross-Industry Standard Process for Data Mining (1996) | 6 phases |
| TDSP | Microsoft Team Data Science Process | 5 phases |
| KDD | Knowledge Discovery in Databases (Fayyad et al., 1996) | 5 phases |
| SEMMA | SAS Institute | 5 phases |
| OSEMN | Community-driven (O’Reilly) | 5 phases |
CRISP-DM is a structured framework used to guide data analytics and machine learning projects from problem definition to deployment.
CRISP-DM remains the most widely used framework in industry (adopted by ~70% of data science projects according to various surveys).
🧠 Simple Flow
Business → Data → Prepare → Model → Evaluate → Deploy
1.9 Types of Analytics
There are four major types of analytics:
Descriptive
Diagnostic
Predictive
Prescriptive
These types answer different questions.
Descriptive Analytics
Question answered:
👉 “What happened?”
Descriptive analytics is the most foundational and widely practiced form. It summarizes historical data to understand what has occurred in the past.
Goal: Provide a clear picture of past events and current state
Methods: Aggregation, summarization, pivot tables, basic visualization, dashboards,Reports, KPIs
Tools: Excel, Tableau, Power BI, SQL, Google Data Studio
Examples in practice:
A retailer’s monthly sales report showing revenue by product category and region
A hospital dashboard showing daily patient admissions, discharges, and average length of stay
A social media analytics report showing follower growth, engagement rates, and top-performing posts
Monthly sales report showing total revenue and top-selling products
Descriptive Analytics – Key Points
Features:
Focuses on past and current data
Provides summaries and trends
Easy to understand
Limitation:
Does not explain why something happened
📌Note: Descriptive analytics answers ‘what’ but not ‘why’. It is the starting point for almost all analytical work and essential for data literacy across the organization. Most reporting and BI functions are primarily descriptive.
Diagnostic Analytics
Question answered:
👉 “Why did it happen?”
Diagnostic analytics goes deeper, investigating the causes and contributing factors behind observed outcomes.
Goal: Understand the root cause of a past event or trend (Identify causes and contributing factors)
Methods: Drill-down analysis, correlation analysis, data mining, hypothesis testing, Comparisons across time or groups
Tools: SQL (with subqueries and window functions), Python/R, statistical tests
Examples in practice:
Investigating why Q3 revenue dropped — discovering a supply chain disruption coincided with competitor price cuts
Analyzing why customer satisfaction scores fell — tracing it to longer wait times in the call cer after a staffing reduction
Identifying why a marketing campaign underperformed — finding that the target audience was incorrectly segmented
Why did sales drop?
Diagnostic analysis may reveal:
Price increase
Reduced advertising
Supply chain issues
Seasonal trends
Key value: Helps explain problems and opportunities
📌 Note: Diagnostic analytics often involves correlation analysis, but analysts must be careful not to confuse correlation with causation. Statistical correlation shows two variables move together; it does not prove one causes the other. Establishing causation requires controlled experiments or causal inference techniques
Predictive Analytics
Question answered:
👉 “What is likely to happen?”
Predictive analytics uses historical patterns to forecast future outcomes. This is where machine learning becomes central.
Goal: Forecast likely future events with quantifiable confidence (Forecast future outcomes using historical data)
Methods: Regression, classification, time series forecasting, Machine learning
Tools: Python (scikit-learn, statsmodels, Prophet), R, Azure ML, AWS SageMaker
Examples in practice:
A bank predicting which customers are likely to default on loans in the next 90 days
A retailer forecasting demand for each product SKU by store location for the upcoming holiday season
A telecom company predicting which customers are at high risk of churning in the next 30 days
A manufacturer predicting when industrial equipment will fail, enabling proactive maintenance
Predict customer churn
Forecast next quarter sales
Estimate loan default risk
Predict disease risk in patients
Predicting disease outbreaks
Key concepts in predictive analytics:
Training data: Historical data used to fit (train) the model — the model learns patterns from this data.
Test data: Held-out data never seen during training, used to evaluate how well the model generalizes to new examples.
Overfitting: When a model learns the training data too well (including its noise) and fails to generalize to new data. A model that memorizes rather than learns.
Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
Benefit:Enables proactive decision-making
Prescriptive Analytics
Question answered:
👉 “What should we do?”
The most advanced form of analytics, prescriptive analytics goes beyond prediction to recommend specific actions that optimize outcomes.
Goal: Identify the optimal action or decision given constraints and objectives (Recommend actions that improve outcomes)
Methods: Mathematical optimization, simulation, reinforcement learning, decision trees, scenario analysis
Tools: Python (SciPy, PuLP, OR-Tools), specialized optimization solvers, simulation platforms
Examples in practice:
An airline dynamically pricing seats to maximize revenue given demand forecasts, competitor prices, and remaining inventory
A logistics company routing delivery vehicles to minimize fuel costs and delivery times simultaneously
A hospital scheduling surgeries and staff to maximize throughput while meeting patient safety standards
An investment algorithm allocating portfolio weights to maximize expected return for a given risk tolerance
Recommend discounts to at-risk customers
Suggest the best delivery route
Optimize staffing levels
Recommend stock levels
Benefit: Supports action and optimization
💡 Key Insight: Prescriptive analytics is often where AI meets decision automation. When a system can not only predict what will happen but also automatically take the best action in response — without human intervention — it becomes a true AI-driven decision system. This is the cutting edge of enterprise analytics
Comparison of the Four Types
| Type | Main Question | Focus | Example |
|---|---|---|---|
| Descriptive | What happened? | Summary of past data | Sales dashboard |
| Diagnostic | Why did it happen? | Causes and reasons | Root cause of sales decline |
| Predictive | What is likely to happen? | Forecasting | Churn prediction |
| Prescriptive | What should we do? | Action recommendation | Best retention strategy |
The Analytics Maturity Model
Organizations typically progress through these four types over time as their data capabilities mature. Most businesses today are strong in descriptive analytics and actively building predictive capabilities. True prescriptive analytics at scale remains rare and represents a significant competitive advantage.
Level 1 — Descriptive: Basic reporting, spreadsheets, reactive decision-making
Level 2 — Diagnostic: Root-cause analysis, some SQL/BI tooling, more structured data teams
Level 3 — Predictive: ML models in production, data science teams, experimentation culture
Level 4 — Prescriptive: AI-driven decision systems, real-time optimization, closed-loop automation
Key idea: Organizations often use all four types together.
1.10 Key Terms & Definitions
Data: raw facts and figures
Information: processed data with meaning
Insight: useful understanding from analysis
KPI: key performance indicator
Dashboard: visual display of metrics
Forecasting: predicting future values
Model: mathematical or computational representation
Data Analytics: The process of examining, cleaning, transforming, and modeling data to discover useful information and support decision-making.
KPI (Key Performance Indicator): A measurable value that demonstrates how effectively an organization is achieving key business objectives.
Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize their main characteristics, often using visual methods, before formal modeling.
Feature Engineering: The process of using domain knowledge to create new input variables (features) from raw data to improve model performance.
Concept Drift: A phenomenon where the statistical properties of the target variable (what the model predicts) change over time, causing model performance to degrade.
Correlation: A statistical measure expressing the extent to which two variables are linearly related. Does not imply causation.
Causation: A relationship where one variable directly causes a change in another. Establishing causation requires controlled experiments or causal inference methods.
Overfitting: A modeling error where a model performs well on training data but poorly on new, unseen data because it has learned noise rather than signal.
ROI (Return on Investment): A performance measure used to evaluate the efficiency or profitability of an investment, often used to justify analytics initiatives.
1.11 Challenges in Adopting Data Analytics
| Challenge | Description |
|---|---|
| Data Quality | Incomplete, inconsistent, or inaccurate data leads to unreliable insights (“Garbage In, Garbage Out”) |
| Data Silos | Data trapped in departmental systems that don’t communicate |
| Talent Shortage | Difficulty finding and retaining skilled data professionals |
| Privacy & Ethics | Compliance with GDPR, CCPA, and ethical use of data |
| Organizational Culture | Resistance to change from intuition-based to evidence-based decisions |
| Infrastructure Costs | Investment in tools, cloud computing, and storage |
| Interpretability | Stakeholders may not trust “black box” models |
1.12 Summary
In this lesson, we learned:
Data analytics helps organizations make informed decisions
It plays a major role in business performance and innovation
The lifecycle includes problem definition through monitoring
The four types of analytics are:
Descriptive
Diagnostic
Predictive
Prescriptive
1.13 Quick Revision
Remember:
Data analytics turns raw data into useful insights
Good analytics starts with a clear problem
Data cleaning is essential
Descriptive and diagnostic focus on the past
Predictive and prescriptive focus on future decisions
1.14 Discussion / Review Questions
What is data analytics?
Why is data analytics important in organizations?
What are the stages of the data analytics lifecycle?
What is the difference between predictive and prescriptive analytics?
Give one example of each type of analytics
Describe three ways that data analytics creates competitive advantage for an organization. Provide a specific example for each.
A company notices that ice cream sales and drowning rates are positively correlated. Does this mean ice cream causes drowning? Explain your reasoning and identify the actual cause of this correlation.
A data team is asked to reduce customer churn. Walk through how you would apply each phase of the data analytics lifecycle to this problem.
Classify each of the following as descriptive, diagnostic, predictive, or prescriptive analytics, and justify your answer:
(a) A dashboard showing weekly website visitors.
(b) A model that recommends which items to reorder and in what quantity.
(c) An analysis identifying why last month’s marketing email had a low open rate.
(d) A model predicting which loan applicants are likely to default.
Why is data preparation often described as taking 60–80% of a data scientist’s time? What activities does this include, and what are the consequences of inadequate data preparation?
1.15 Further Reading & References
CRISP-DM Guide: Wirth, R., & Hipp, J. (2000). CRISP-DM: Towards a Standard Process Model for Data Mining.
Davenport, T.H., & Harris, J.G. (2007). Competing on Analytics. Harvard Business School Press.
Provost, F., & Fawcett, T. (2013). Data Science for Business. O’Reilly Media.
EMC Education Services. (2015). Data Science and Big Data Analytics. Wiley.
McKinsey Global Institute. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity.
Topic Two: Data Types and Data Collection
2.1 Introduction
Before any analysis can begin, a data professional must understand the nature and structure of data they are dealing with and where it came from. The wrong assumptions about data structure can invalidate an entire analysis. This chapter covers the landscape of data types, how data is collected, the characteristics of different data sources, and the foundational concepts of data quality and governance that determine whether data can be trusted.
The type of data determines:
How it is stored (databases, file systems, data lakes)
How it is processed (SQL queries, parsing, NLP, computer vision)
What tools and technologies are appropriate
What preprocessing steps are needed before analysis
What analytical methods can be applied
“Key Fact: According to IDC estimates, approximately 80–90% of all data generated today is unstructured or semi-structured. Yet most traditional analytics tools were designed for structured data. This gap is one of the primary drivers behind the growth of data science, ML, and AI”
2.2 Types of Data
2.2.1 Structured Data
Definition
Structured data is data that adheres to a predefined schema or data model, organized into rows and columns (tabular format), where each field has a defined data type and meaning.
It is the most traditional and well-understood form of data. It lives in relational databases and can be queried using SQL (Structured Query Language).
Characteristics
| Property | Description |
|---|---|
| Schema | Predefined and rigid (schema-on-write) |
| Format | Tabular — rows (records) and columns (fields/attributes) |
| Data Types | Each column has a strict type (integer, float, varchar, date, boolean) |
| Storage | Relational Database Management Systems (RDBMS) |
| Query Language | SQL |
| Searchability | Highly searchable and filterable |
| Scalability | Vertical scaling; limited for massive volumes |
| Machine Readability | Easily consumed by analytics tools and algorithms |
Examples of Structured Data
Example 1: Customer Table in a Relational Database
| CustomerID | Name | Age | City | SignupDate | AccountType | |
|---|---|---|---|---|---|---|
| 1001 | Alice Johnson | alice@email.com | 34 | New York | 2023-01-15 | Premium |
| 1002 | Bob Smith | bob@email.com | 28 | Chicago | 2023-03-22 | Free |
| 1003 | Carol Williams | carol@email.com | 45 | Boston | 2022-11-08 | Premium |
Example 2: Financial Transaction Records
| TransactionID | Date | AccountNo | Type | Amount | Currency | Status |
|---|---|---|---|---|---|---|
| TXN-50001 | 2024-01-15 | ACC-2234 | Debit | 250.00 | USD | Completed |
| TXN-50002 | 2024-01-15 | ACC-1187 | Credit | 1200.50 | USD | Pending |
Other Examples:
Spreadsheets (Excel, Google Sheets)
ERP system records (inventory, orders, invoices)
CRM data (customer profiles, interactions)
Sensor readings in fixed formats (temperature, pressure at timestamps)
Census data
Stock market tick data (OHLCV — Open, High, Low, Close, Volume)
Common Storage Systems for Structured Data
| System | Examples |
|---|---|
| RDBMS | MySQL, PostgreSQL, Oracle, Microsoft SQL Server, SQLite |
| Cloud Data Warehouses | Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse |
| Spreadsheets | Microsoft Excel, Google Sheets |
| Flat Files | CSV (Comma-Separated Values), TSV (Tab-Separated Values) |
Data Types Within Structured Data
Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid
| Type | Subtype | Description | Examples | Valid Operations |
|---|---|---|---|---|
| Qualitative | Nominal | Categories with no inherent order | Gender, color, city, blood type, product category | Mode, frequency count, chi-square test |
| Qualitative | Ordinal | Categories with a meaningful order but unequal intervals | Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale | Mode, median, rank correlation, non-parametric tests |
| Quantitative | Discrete | Countable, finite values (usually integers) | Number of children, website clicks, defect count, cars owned | Mean, median, mode, standard deviation, Poisson regression |
| Quantitative | Continuous | Measurable, can take any value in a range (infinite precision) | Temperature, height, weight, salary, time | Mean, std dev, correlation, regression, t-test, ANOVA |
Advantages and Limitations
| Advantages | Limitations |
|---|---|
| Easy to query, filter, and aggregate (SQL) | Rigid schema — hard to modify after creation |
| Well-understood tools and technologies | Cannot represent complex, nested, or hierarchical data well |
| Strong data integrity (constraints, keys, types) | Doesn’t handle multimedia data (images, audio, video) |
| Efficient indexing and searching | Scaling horizontally (across servers) can be challenging |
| Mature ecosystem of tools | Only represents ~10-20% of all organizational data |
2.2.2 Semi-Structured Data
Definition
Semi-structured data has some organizational properties (tags, markers, metadata, or hierarchical structure) but does not conform to a rigid tabular schema. It is self-describing — the structure is embedded within the data itself.
Semi-structured data sits between structured and unstructured data. It has some organization but is more flexible than a relational table.
Characteristics
| Property | Description |
|---|---|
| Schema | Flexible, implicit, self-describing (schema-on-read) |
| Format | Hierarchical, nested, key-value pairs, tagged |
| Data Types | Mixed and flexible; fields can vary between records |
| Storage | NoSQL databases, document stores, file systems |
| Query Language | Varies (JSONPath, XPath, MongoDB query language, etc.) |
| Searchability | Searchable with appropriate tools but less straightforward than SQL |
| Flexibility | High — new fields can be added without changing existing structure |
Example: Web Server Log (Semi-Structured)
192.168.1.105 - alice [15/Jan/2024:10:23:45 +0000] "GET /products/laptop HTTP/1.1" 200 5432 "https://google.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" 192.168.1.110 - bob [15/Jan/2024:10:24:02 +0000] "POST /cart/add HTTP/1.1" 201 342 "https://example.com/products" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0)"
This data has a recognizable pattern (IP, user, timestamp, request, status code, referrer, user agent) but is not in a tabular database. Parsing is required to extract structured fields.
2.2.3 Unstructured Data
Definition
Unstructured data has no predefined data model, schema, or organizational structure. It cannot be stored in traditional row-column databases without significant transformation.
This is the most abundant type of data in the world and the hardest to analyze using traditional methods. Advances in AI, NLP, and computer vision have made it increasingly possible to extract insights from unstructured data.
Characteristics
| Property | Description |
|---|---|
| Schema | None — no predefined structure |
| Format | Free-form text, binary files, media |
| Storage | File systems, object stores, data lakes, content management systems |
| Analysis | Requires AI/ML techniques (NLP, computer vision, speech recognition) |
| Volume | Comprises ~80-90% of all enterprise data |
| Searchability | Difficult without indexing, tagging, or AI-based extraction |
| Category | Examples |
|---|---|
| Text | Emails (body), social media posts, reviews, news articles, legal contracts, medical notes, chat transcripts, books, research papers |
| Images | Photographs, medical scans (X-rays, MRIs, CT scans), satellite imagery, product photos, handwritten documents, diagrams |
| Audio | Phone call recordings, podcasts, music files, voice messages, voice assistant queries |
| Video | Surveillance footage, YouTube videos, webinars, live streams, movie files |
| Other | Geospatial data, 3D models, scientific instrument output, biometric data |
How AI/ML Processes Unstructured Data
The core challenge with unstructured data is that machines cannot directly analyze raw text, images, or audio. These must be converted into numerical representations (features/vectors) first:
| Data Type | AI/ML Technique | What It Does |
|---|---|---|
| Text | Natural Language Processing (NLP) | Tokenization, sentiment analysis, topic modeling, named entity recognition, text classification, machine translation |
| Text → Numbers | Word Embeddings (Word2Vec, GloVe, BERT) | Converts words/sentences into dense numerical vectors that capture semantic meaning |
| Images | Computer Vision (CNNs) | Object detection, image classification, facial recognition, segmentation |
| Images → Numbers | Feature Extraction (ResNet, VGG) | Converts images into numerical feature vectors using pre-trained neural networks |
| Audio | Speech Recognition (ASR) | Converts speech to text (e.g., Whisper, Google Speech API) |
| Audio → Numbers | Spectrograms, MFCCs | Converts audio waveforms into frequency-domain representations |
| Video | Video Analysis (3D CNNs, RNNs) | Action recognition, object tracking, scene understanding |
Example: Turning Unstructured Text into Structured Data
Raw unstructured data (customer review)
"I absolutely love this laptop! The battery life is amazing and the screen is gorgeous. However, the keyboard feels a bit cheap and the trackpad is not very responsive. Overall, a great purchase for the price."
After NLP processing → Structured output:
| Field | Extracted Value |
|---|---|
| Overall Sentiment | Positive (0.72) |
| Battery Sentiment | Very Positive (0.95) |
| Screen Sentiment | Very Positive (0.91) |
| Keyboard Sentiment | Negative (-0.45) |
| Trackpad Sentiment | Negative (-0.60) |
| Price Sentiment | Positive (0.65) |
| Named Entities | Product Type: Laptop |
| Key Topics | Battery, Screen, Keyboard, Trackpad, Price |
This transformation from unstructured text to structured data is one of the most important applications of NLP in data science.
Data Types Within Structured Data
Within structured data, individual fields/variables fall into specific statistical data types that determine what analyses are valid
| Type | Subtype | Description | Examples | Valid Operations |
|---|---|---|---|---|
| Qualitative | Nominal | Categories with no inherent order | Gender, color, city, blood type, product category | Mode, frequency count, chi-square test |
| Qualitative | Ordinal | Categories with a meaningful order but unequal intervals | Education level (HS < BS < MS < PhD), satisfaction (1-5 stars), pain scale | Mode, median, rank correlation, non-parametric tests |
| Quantitative | Discrete | Countable, finite values (usually integers) | Number of children, website clicks, defect count, cars owned | Mean, median, mode, standard deviation, Poisson regression |
| Quantitative | Continuous | Measurable, can take any value in a range (infinite precision) | Temperature, height, weight, salary, time | Mean, std dev, correlation, regression, t-test, ANOVA |
2.2.4 Special Data Types
| Type | Description | Examples |
|---|---|---|
| Binary | Only two possible values | Yes/No, True/False, 0/1, Male/Female |
| Temporal | Date, time, datetime, timestamp | 2024-01-15, 14:30:00, timestamps |
| Geospatial | Location-based data | Latitude/longitude, GPS coordinates, polygons |
| Currency | Monetary values with specific precision | $1,299.99, €45.50, Kshs1500 |
| Text/String | Character sequences (can be categorical or free-text) | Names, descriptions, comments |
| Boolean | Logical true/false values | is_active, has_subscription |
Why This Matters for ML: Algorithms require numerical input. Understanding data types determines how to encode categorical variables (one-hot encoding for nominal, label encoding for ordinal) and how to scale numerical variables (normalization, standardization).
2.3 Primary vs. Secondary Data
2.3.1 Primary Data
Definition
Primary data is data collected firsthand by the researcher or organization specifically for the current research question or business problem. It is original data that did not exist before the collection effort.
Characteristics
| Property | Description |
|---|---|
| Originality | Collected for the first time, directly from the source |
| Specificity | Tailored to the exact research question |
| Control | Researcher controls methodology, sampling, and variables |
| Recency | Typically the most current/up-to-date data available |
| Cost | Generally more expensive and time-consuming to collect |
| Ownership | The collector owns the data |
Methods of Primary Data Collection
| Method | Description | Best For | Example |
|---|---|---|---|
| Surveys / Questionnaires | Structured questions distributed to a sample | Gathering opinions, preferences, demographics at scale | Customer satisfaction survey, NPS score |
| Interviews | In-depth, one-on-one or group conversations | Deep qualitative insights, understanding “why” | User research interviews for product design |
| Focus Groups | Moderated group discussions (6-12 participants) | Exploring perceptions, attitudes, new concepts | Testing reactions to a new product concept |
| Experiments / A/B Tests | Controlled manipulation of variables to measure effect | Establishing causal relationships | Testing two website layouts to see which converts better |
| Observations | Systematically watching and recording behavior | Understanding behavior in natural settings | Recording how customers navigate a store |
| Sensor / IoT Data Collection | Deploying instruments to measure physical phenomena | Real-time monitoring, environmental data | Installing temperature sensors in a warehouse |
| Web Scraping (owned properties) | Automated extraction of data from your own platforms | Collecting user interaction data | Logging clickstream data on your website |
| Clinical Trials | Controlled medical experiments | Testing drug efficacy and safety | Pharmaceutical Phase III trial |
| Field Research | Collecting samples or measurements in the field | Environmental, geological, agricultural research | Soil sampling for agricultural analysis |
Advantages and Disadvantages of Primary Data
| Advantages | Disadvantages |
|---|---|
| Directly relevant to research question | Expensive (survey design, distribution, collection) |
| Researcher controls quality and methodology | Time-consuming (weeks to months) |
| Most current and up-to-date | Requires expertise in research design |
| Can target specific populations | Subject to response bias, sampling bias |
| Proprietary — competitive advantage | Typically smaller sample sizes than secondary data |
| Can collect exactly the variables needed | Ethical considerations (consent, privacy, IRB approval) |
Designing Good Primary Data Collection
Key Principles for Survey/Questionnaire Design:
Define clear objectives — What exactly do you want to learn?
Choose the right question types:
Closed-ended (multiple choice, Likert scale, yes/no) — Easy to analyze quantitatively
Open-ended (free text) — Rich qualitative data but harder to analyze
Avoid leading questions — “Don’t you agree our product is excellent?” ❌
Avoid double-barreled questions — “Is our product affordable and high-quality?” ❌ (these are two separate questions)
Use simple, unambiguous language
Consider question order — General to specific, easy to hard
Pilot test before full deployment
Ensure proper sampling — Random sampling, stratified sampling, etc.
2.3.2 Secondary Data
Definition
Secondary data is data that was originally collected by someone else for a different purpose and is being reused for the current analysis. The researcher accesses and analyzes existing data rather than collecting new data.
Characteristics
| Property | Description |
|---|---|
| Originality | Pre-existing; not collected for the current purpose |
| Collection | No direct collection effort needed |
| Cost | Generally much cheaper (often free) |
| Speed | Available immediately or quickly |
| Scale | Often much larger datasets than primary collection allows |
| Control | No control over how data was collected, what was measured, or quality |
Sources of Secondary Data
| Category | Examples |
|---|---|
| Government & Public Institutions | Census data, Bureau of Labor Statistics, World Bank, WHO, UN Data, data.gov, Eurostat |
| Academic & Research | Published papers, university datasets, arXiv, Google Scholar, ICPSR |
| Industry Reports | Gartner, McKinsey, Deloitte, PwC, Forrester, Nielsen reports |
| Company Internal Data | Historical sales records, CRM data, past surveys, financial records (collected for operational purposes, now reused for analytics) |
| Open Data Platforms | Kaggle, UCI ML Repository, Google Dataset Search, AWS Open Data, HuggingFace Datasets |
| Social Media & Web | Twitter/X API data, Reddit, Wikipedia, Common Crawl |
| Financial Data | Yahoo Finance, Bloomberg, SEC filings (EDGAR), stock exchange data |
| Geospatial Data | OpenStreetMap, NASA Earthdata, Google Earth Engine |
| Healthcare Data | MIMIC-III (clinical data), NIH databases, CDC data |
| Media & News | News archives, GDELT project (global events database) |
Advantages and Disadvantages of Secondary Data
| Advantages | Disadvantages |
|---|---|
| Significantly cheaper (often free) | May not perfectly fit your research question |
| Available quickly — no collection time | No control over data quality or methodology |
| Often very large datasets | May be outdated |
| Enables historical and longitudinal analysis | Definitions/categories may not match your needs |
| Can cover broad geographies and populations | Potential biases from original collection unknown |
| Peer-reviewed or government-validated | May have restrictions on use (licensing, privacy) |
| Good for benchmarking and comparison | May lack variables you specifically need |
Evaluating Secondary Data Quality
Before using secondary data, assess it critically:
| Criterion | Questions to Ask |
|---|---|
| Source credibility | Who collected it? Is the source reputable? Government? Academic? |
| Purpose | Why was it originally collected? Could the purpose introduce bias? |
| Methodology | How was it collected? What sampling method was used? |
| Timeliness | When was it collected? Is it still relevant? |
| Accuracy | Are there known errors or limitations? Has it been peer-reviewed? |
| Consistency | Are definitions and units consistent across time periods? |
| Completeness | Are there significant gaps or missing data? |
| Accessibility | Can you access the granularity you need? Are there licensing restrictions? |
2.3.3 Primary vs. Secondary Data — Comparison
| Dimension | Primary Data | Secondary Data |
|---|---|---|
| Collected by | Researcher/organization for current purpose | Someone else for a different purpose |
| Relevance | Highly relevant and specific | May not perfectly fit |
| Cost | High | Low (often free) |
| Time to obtain | Weeks to months | Hours to days |
| Data quality control | Full control | No control |
| Sample size | Usually smaller | Often very large |
| Recency | Most current | May be outdated |
| Ownership | You own it | May have usage restrictions |
| Bias awareness | Known (you designed the study) | Unknown or undocumented |
| Uniqueness | Proprietary — competitive advantage | Available to competitors too |
2.3.4 When to Use Which?
| Use Primary Data When… | Use Secondary Data When… |
|---|---|
| No existing data answers your question | Existing data adequately addresses your question |
| You need very specific variables | You need broad coverage or historical data |
| Data quality is paramount | Budget and time are limited |
| You need proprietary insights | You need a starting point for exploratory analysis |
| Establishing causal relationships (experiments) | Benchmarking against industry or population data |
| Regulatory requirements demand original data | Supplementing primary data with contextual data |
“Best Practice: Most data science projects use a combination of both. For example, a company might use its own customer transaction data (primary) enriched with census demographic data (secondary) and weather data (secondary) to build a predictive model.”
2.4 Data Sources: Surveys, APIs, Databases, and Web Data
2.4.1 Surveys and Forms
What Are Surveys?
Surveys are systematic methods of gathering information from a defined population through a set of structured or semi-structured questions, typically for research, feedback, or data collection purposes.
Types of Surveys
| Type | Description | Advantages | Limitations |
|---|---|---|---|
| Online Surveys | Web-based questionnaires distributed via email, social media, or embedded in websites | Cheap, fast, wide reach, easy analysis | Low response rates, self-selection bias, no interviewer to clarify |
| Telephone Surveys (CATI) | Computer-Assisted Telephone Interviewing | Higher response rates than online, can clarify questions | Expensive, declining landline usage, time-consuming |
| Face-to-Face Interviews | In-person structured or semi-structured interviews | Highest quality responses, non-verbal cues | Very expensive, interviewer bias, not scalable |
| Mail Surveys | Paper questionnaires sent and returned by mail | Reaches populations without internet | Slowest method, very low response rates |
| Mobile Surveys | Optimized for smartphones | High accessibility, in-the-moment capture | Screen size limits complexity |
| Longitudinal / Panel Surveys | Same participants surveyed repeatedly over time | Tracks changes and trends over time | Attrition of participants |
Popular Survey Tools
| Tool | Key Features |
|---|---|
| Google Forms | Free, simple, integrates with Google Sheets |
| SurveyMonkey | Professional features, templates, analytics |
| Typeform | Interactive, conversational UI, good UX |
| Qualtrics | Enterprise-grade, advanced logic, research-focused |
| Microsoft Forms | Integrated with Microsoft 365 |
| LimeSurvey | Open-source, self-hosted option |
| REDCap | Specialized for clinical and academic research |
Sampling Methods for Surveys
| Method | Type | Description |
|---|---|---|
| Simple Random | Probability | Every member has an equal chance of selection |
| Stratified Random | Probability | Population divided into strata; random sample from each stratum |
| Cluster | Probability | Population divided into clusters; entire clusters randomly selected |
| Systematic | Probability | Every kth member selected from a list |
| Convenience | Non-Probability | Whoever is available/easiest to reach |
| Snowball | Non-Probability | Existing participants recruit future participants |
| Quota | Non-Probability | Sample selected to match known population proportions |
| Purposive/Judgmental | Non-Probability | Researcher selects participants based on judgment |
Key Consideration: Probability sampling allows statistical inference to the broader population. Non-probability sampling is easier and cheaper but results cannot be generalized with the same confidence.
Common Survey Biases
| Bias | Description | Mitigation |
|---|---|---|
| Selection Bias | Sample not representative of the population | Use probability sampling |
| Response Bias | Respondents answer inaccurately (social desirability, acquiescence) | Anonymize, use neutral wording |
| Non-Response Bias | Those who respond differ systematically from those who don’t | Follow-up reminders, incentives, analyze non-respondents |
| Leading Question Bias | Questions that suggest a desired answer | Neutral question wording, pilot testing |
| Recall Bias | Respondents don’t accurately remember past events | Use shorter recall periods, provide reference points |
| Survivorship Bias | Only surveying current customers, not those who left | Include churned/former customers |
| Order Effects | Answer influenced by position in questionnaire | Randomize question order |
2.4.2 APIs (Application Programming Interfaces)
What Is an API?
An API (Application Programming Interface) is a set of defined rules, protocols, and tools that allows different software applications to communicate with each other and exchange data in a structured, programmatic way.
For data scientists, APIs are a primary mechanism for programmatically accessing data from external services, platforms, and databases.
Popular APIs for Data Science
| Category | API | Data Provided |
|---|---|---|
| Social Media | Twitter/X API, Reddit API, Meta Graph API | Posts, tweets, user data, engagement metrics |
| Finance | Alpha Vantage, Yahoo Finance, Polygon.io, Quandl | Stock prices, financial statements, crypto data |
| Weather | OpenWeatherMap, WeatherAPI, NOAA | Current weather, forecasts, historical weather |
| Maps & Location | Google Maps API, OpenStreetMap, Mapbox | Geocoding, directions, places, traffic |
| NLP & AI | OpenAI API (GPT), Google Cloud NLP, HuggingFace | Text generation, sentiment analysis, translation |
| Government | Census API, data.gov, World Bank API | Demographics, economic indicators, health data |
| E-commerce | Amazon Product API, Shopify API, eBay API | Product data, pricing, reviews |
| News | NewsAPI, GDELT, NYTimes API | News articles, headlines, events |
| Music | Spotify API, Last.fm API | Song data, playlists, listening history |
| Sports | ESPN API, SportRadar, NBA Stats API | Scores, player statistics, game data |
API Authentication Methods
| Method | Description | Security Level |
|---|---|---|
| API Key | Simple key passed as a query parameter or header | Basic |
| OAuth 2.0 | Token-based authorization; user grants permission | High |
| Bearer Token | Token included in the HTTP Authorization header | Medium-High |
| Basic Auth | Username and password encoded in base64 | Low |
| JWT (JSON Web Token) | Self-contained token with encoded user info | High |
API Best Practices for Data Collection
| Practice | Description |
|---|---|
| Respect Rate Limits | Most APIs limit requests per minute/hour; implement backoff strategies |
| Cache Responses | Store API responses locally to avoid redundant calls |
| Handle Errors Gracefully | Implement try/except blocks, retry logic, and logging |
| Paginate Large Requests | Many APIs return data in pages; loop through all pages |
| Secure Your Keys | Never hardcode API keys; use environment variables or secret managers |
| Read Documentation | Always read the API docs thoroughly before coding |
| Monitor Usage | Track API consumption to avoid exceeding quotas or incurring costs |
| Version Awareness | APIs can change; pin to specific versions when possible |
2.4.3 Databases and Data Stores
What Is a Database?
A database is an organized collection of data stored and accessed electronically, managed by a Database Management System (DBMS) that provides mechanisms for storing, retrieving, updating, and managing data.
Relational Databases (RDBMS)
The backbone of structured data storage for decades. Data is organized into tables with relationships between them.
2.4.4 Core Concepts
| Concept | Description |
|---|---|
| Table (Relation) | A collection of related data organized in rows and columns |
| Row (Record/Tuple) | A single data entry in a table |
| Column (Field/Attribute) | A specific property/variable in a table |
| Primary Key (PK) | A unique identifier for each row in a table |
| Foreign Key (FK) | A column that references the primary key of another table (creates relationships) |
| Index | A data structure that speeds up data retrieval |
| Schema | The blueprint/structure of the database (tables, columns, types, constraints) |
| View | A virtual table based on the result of a SQL query |
| Stored Procedure | Pre-compiled SQL code stored in the database |
Popular Relational Databases
| Database | Type | Best For |
|---|---|---|
| PostgreSQL | Open source | General purpose, advanced features, geospatial |
| MySQL | Open source | Web applications, WordPress, scalable reads |
| SQLite | Embedded | Local applications, prototyping, mobile apps |
| Microsoft SQL Server | Commercial | Enterprise Windows environments |
| Oracle Database | Commercial | Large enterprise, financial services |
| MariaDB | Open source (MySQL fork) | Drop-in MySQL replacement |
2.4.5 Summary: Choosing the Right Data Source
| Factor | Consideration |
|---|---|
| Research Question | What data do you actually need to answer your question? |
| Availability | Does the data exist? Is it accessible? |
| Quality | How reliable, complete, and accurate is the data? |
| Cost | What is the budget for data acquisition? |
| Time | How quickly do you need the data? |
| Legal/Ethical | Are there privacy, licensing, or ethical constraints? |
| Format | Can you work with the data format, or is significant transformation needed? |
| Scale | Is the data volume appropriate for your analysis needs? |
| Freshness | How current does the data need to be? |
2.5 Data Quality and Data Governance Concepts
Why Data Quality Matters
“Garbage In, Garbage Out” (GIGO)— The most sophisticated algorithm in the world will produce meaningless results if fed poor-quality data.
Data quality is not just a technical concern — it has real business impact:
| Impact Area | Consequence of Poor Data Quality |
|---|---|
| Decision-Making | Wrong conclusions lead to wrong decisions |
| Financial | Gartner estimates poor data quality costs organizations an average of $12.9 million per year |
| Customer Experience | Wrong addresses, duplicate communications, personalization failures |
| Regulatory | Compliance violations (fines under GDPR can reach €20M or 4% of global revenue) |
| Model Performance | ML models trained on dirty data produce unreliable predictions |
| Operational | Failed processes, reconciliation delays, manual workarounds |
| Trust | Stakeholders lose confidence in analytics and reporting |
| Opportunity Cost | Data scientists spend 60-80% of their time cleaning data instead of analyzing it |
2.5.1 Dimensions of Data Quality
Data quality is multidimensional. A dataset may score well on one dimension but poorly on another. The most widely recognized dimensions are:
The Six Core Dimensions (DAMA Framework)
| # | Dimension | Definition | Example of Poor Quality | How to Measure |
|---|---|---|---|---|
| 1 | Accuracy | Data correctly represents the real-world entity or event it models | Customer age recorded as 250; address doesn’t match actual location | % of records matching a verified source; error rate |
| 2 | Completeness | All required data is present; no critical values are missing | 30% of customer records have no email address; missing ZIP codes | % of non-null values; % of records with all required fields |
| 3 | Consistency | Data does not contradict itself across systems or within a dataset | Customer listed as “Active” in CRM but “Cancelled” in billing system; “NY” vs “New York” | Cross-system reconciliation; # of conflicting records |
| 4 | Timeliness | Data is up-to-date and available when needed | Using 2019 market data for 2024 decisions; dashboard refreshed weekly instead of hourly | Data age; refresh frequency; latency |
| 5 | Validity | Data conforms to defined formats, ranges, and business rules | Email without “@” symbol; age = -5; date format “13/25/2024” | % passing validation rules; # of constraint violations |
| 6 | Uniqueness | Each entity is represented only once (no duplicates) | Same customer appears 3 times with slightly different names | Duplicate rate; # of records after deduplication |
Additional Quality Dimensions
| Dimension | Definition | Example |
|---|---|---|
| Integrity | Relationships between data elements are maintained (referential integrity) | An order references a customer_id that doesn’t exist in the customer table |
| Relevance | Data is applicable and useful for the intended purpose | Collecting shoe size data for a financial fraud model |
| Precision | Level of detail/granularity is appropriate | Recording revenue as “about $1M” vs “$1,023,456.78” |
| Conformity | Data follows standard formats and naming conventions | Dates stored as “Jan 15, 2024”, “2024-01-15”, “15/01/2024” inconsistently |
| Auditability | Data lineage and changes can be traced | No log of who changed a record or when |
Common Data Quality Issues
| Issue | Description | Example | Impact |
|---|---|---|---|
| Missing Values | Null, blank, or absent data points | Empty phone number field; NULL income | Biased analysis; model errors |
| Duplicates | Same entity recorded multiple times | “John Smith” and “Jon Smith” at same address | Inflated counts; wasted marketing spend |
| Inconsistent Formats | Same information represented differently | “USA”, “United States”, “US”, “U.S.A.” | Grouping/aggregation errors |
| Outliers | Extreme values that may or may not be valid | Salary of $10,000,000 for a junior analyst | Skewed statistics; model distortion |
| Stale Data | Data that is no longer current | Customer address from 5 years ago | Failed deliveries; wrong analysis |
| Incorrect Data | Factually wrong values | Birth year 2095; negative quantities | Wrong conclusions; compliance risk |
| Encoding Issues | Character set or encoding problems | “Café” instead of “Café”; garbled text | Data loss; parsing failures |
| Schema Changes | Data structure changes without documentation | New column added; column renamed | Pipeline failures; broken queries |
| Unit Mismatches | Different measurement units mixed | Temperature in Celsius and Fahrenheit in same column | Mathematical errors |
| Selection Bias | Data doesn’t represent the target population | Only surveying English-speaking users | Biased models; unfair outcomes |
| Label Errors | Incorrect labels in supervised learning data | Image of a cat labeled as “dog” | Poor model training |
Data Quality Assessment and Profiling
Data profiling is the process of examining data to understand its structure, content, quality, and relationships. It is the first step in any data quality improvement effort.
Data Profiling Techniques
| Technique | What It Reveals |
|---|---|
| Column Analysis | Data type, % null, distinct values, min/max, mean/median, distribution |
| Pattern Analysis | Common formats, regex patterns, unexpected characters |
| Frequency Analysis | Most/least common values, distribution of categories |
| Cross-Column Analysis | Correlations, dependencies, functional relationships |
| Cross-Table Analysis | Referential integrity, join quality, orphan records |
| Temporal Analysis | Trends over time, gaps in time series, seasonality |
| Rule-Based Validation | Checking against predefined business rules |
2.5.2 Data Profiling Example in Python
*import pandas as pd import numpy as np
Load dataset
df = pd.read_csv(‘customer_data.csv’)
=== BASIC PROFILING ===
Shape and types
print(f”Shape: {df.shape}“) print(f”\nData Types:\n{df.dtypes}“)
Missing values analysis
missing = df.isnull().sum() missing_pct = (df.isnull().sum() / len(df)) * 100 missing_report = pd.DataFrame({ ‘Missing Count’: missing, ‘Missing %’: missing_pct.round(2) }).sort_values(‘Missing %’, ascending=False) print(f”\nMissing Values:\n{missing_report}“)
Summary statistics
print(f”\nNumerical Summary:\n{df.describe()}“) print(f”\nCategorical Summary:\n{df.describe(include=‘object’)}“)
Duplicate detection
duplicates = df.duplicated().sum() print(f”\nDuplicate Rows: {duplicates} ({(duplicates/len(df)*100):.2f}%)“)
Unique values per column
for col in df.columns: n_unique = df[col].nunique() print(f”{col}: {n_unique} unique values ({(n_unique/len(df)*100):.1f}%)“)
Value distribution for categorical columns
for col in df.select_dtypes(include=‘object’).columns: print(f”\n{col} - Top 10 Values:“) print(df[col].value_counts().head(10))
Outlier detection using IQR
for col in df.select_dtypes(include=[np.number]).columns: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = ((df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)).sum() print(f”{col}: {outliers} outliers ({(outliers/len(df)*100):.2f}%)“)*
Why Data Quality Matters
*# === AUTOMATED PROFILING TOOLS ===
Using pandas-profiling (ydata-profiling)
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title=“Customer Data Quality Report”) profile.to_file(“data_quality_report.html”)
Using Great Expectations (rule-based validation)
import great_expectations as gx
context = gx.get_context() # Define expectations validator = context.sources.pandas_default.read_dataframe(df) validator.expect_column_values_to_not_be_null(“customer_id”) validator.expect_column_values_to_be_between(“age”, min_value=0, max_value=120) validator.expect_column_values_to_match_regex(“email”, r”^[\w\.-]+@[\w\.-]+\.\w+$“) validator.expect_column_values_to_be_in_set(”status”, [“Active”, “Inactive”, “Suspended”]) validator.expect_column_values_to_be_unique(“customer_id”)*
2.5.3 Data Cleaning Strategies
| Issue | Strategy | Python Example |
|---|---|---|
| Missing Values | Drop, impute (mean/median/mode), forward/backward fill, predictive imputation | df['age'].fillna(df['age'].median(), inplace=True) |
| Duplicates | Identify and remove exact and fuzzy duplicates | df.drop_duplicates(subset=['email'], keep='first', inplace=True) |
| Inconsistent Categories | Standardize and map values | df['country'].replace({'US':'United States', 'USA':'United States'}, inplace=True) |
| Outliers | Remove, cap/floor (winsorize), or transform | df['income'] = df['income'].clip(lower=0, upper=df['income'].quantile(0.99)) |
| Wrong Data Types | Cast to correct types | df['date'] = pd.to_datetime(df['date']) |
| Whitespace/Formatting | Strip and normalize | df['name'] = df['name'].str.strip().str.title() |
| Invalid Values | Validate against rules; replace or flag | df.loc[df['age'] < 0, 'age'] = np.nan |
2.5.4 Data Governance
Definition
Data Governance is the overall management of the availability, usability, integrity, quality, and security of data used in an organization. It establishes the policies, processes, standards, roles, and metrics that ensure effective and efficient use of data.
Data governance is not just a technology problem — it is an organizational discipline that encompasses people, processes, and technology.
2.5.5 Data Governance Policies
| Policy Area | Description | Examples |
|---|---|---|
| Data Classification | Categorizing data by sensitivity level | Public, Internal, Confidential, Restricted/Secret |
| Data Access Control | Who can access what data and under what conditions | Role-based access (RBAC), need-to-know basis |
| Data Retention | How long data is kept and when it is archived/deleted | Financial records retained for 7 years; logs for 90 days |
| Data Privacy | How personal data is collected, used, stored, and shared | Consent management, anonymization, right to deletion |
| Data Quality Standards | Minimum quality thresholds for data to be used | Completeness > 95%, accuracy verified quarterly |
| Data Sharing | Rules for sharing data internally and externally | Data sharing agreements, anonymization requirements |
| Acceptable Use | How data may and may not be used | No using health data for marketing without explicit consent |
| Master Data Management | Standards for maintaining master/reference data | Single source of truth for customer, product data |
2.5.6 Data Privacy and Regulatory Compliance
Major Data Privacy Regulations
| Regulation | Region | Key Requirements |
|---|---|---|
| GDPR (General Data Protection Regulation) | European Union (2018) | Consent, right to access/delete/port data, data protection by design, breach notification within 72 hours, DPO appointment |
| CCPA / CPRA (California Consumer Privacy Act / California Privacy Rights Act) | California, USA (2020/2023) | Right to know, delete, opt-out of data sale, non-discrimination |
| HIPAA (Health Insurance Portability and Accountability Act) | USA | Protects health information (PHI); strict security and access controls |
GDPR Key Principles
| Principle | Description |
|---|---|
| Lawfulness, Fairness, Transparency | Data must be processed legally, fairly, and transparently |
| Purpose Limitation | Data collected for specific, explicit purposes only |
| Data Minimization | Only collect data that is necessary for the stated purpose |
| Accuracy | Data must be accurate and kept up to date |
| Storage Limitation | Data should not be kept longer than necessary |
| Integrity & Confidentiality | Data must be protected against unauthorized access, loss, or damage |
| Accountability | Organizations must demonstrate compliance |
Data Security Fundamentals
| Security Measure | Description |
|---|---|
| Encryption | Encrypting data at rest (stored) and in transit (transmitted) — AES-256, TLS/SSL |
| Access Control | Role-Based Access Control (RBAC); principle of least privilege |
| Authentication | Verifying identity — passwords, MFA (multi-factor authentication), SSO |
| Authorization | Determining what authenticated users are allowed to do |
| Anonymization | Removing personally identifiable information (PII) irreversibly |
| Pseudonymization | Replacing identifiers with pseudonyms (reversible with a key) |
| Data Masking | Hiding sensitive data (e.g., showing only last 4 digits of SSN: XXX-XX-1234) |
| Audit Logging | Recording who accessed what data, when, and what they did |
| Network Security | Firewalls, VPNs, intrusion detection systems |
| Backup & Recovery | Regular backups with tested recovery procedures |
2.5.7 Data Ethics
Beyond legal compliance, data practitioners must consider ethical responsibilities:
2.5.8 Key Ethical Principles
| Principle | Description |
|---|---|
| Transparency | Be clear about what data you collect, why, and how it will be used |
| Fairness | Ensure analyses and models do not discriminate against protected groups |
| Privacy | Respect individuals’ right to control their personal information |
| Consent | Obtain informed consent before collecting personal data |
| Accountability | Take responsibility for the outcomes of data-driven decisions |
| Beneficence | Ensure data use creates benefit and minimizes harm |
| Data Minimization | Only collect and retain data that is necessary |
| Human Oversight | Maintain human review for high-stakes automated decisions |
2.6 Summary
Data comes in three structural forms:
Structured (~10-20%): Tables with rows and columns, stored in RDBMS, queried with SQL
Semi-Structured (~5-10%): JSON, XML, logs — some organization but flexible schema
Unstructured (~80-90%): Text, images, audio, video — requires AI/ML to analyze
Primary data is collected firsthand for your specific purpose (high relevance, high cost). Secondary data is pre-existing data collected by others (lower cost, may not perfectly fit).
Data sources are diverse — surveys, APIs, databases, web scraping, IoT sensors, public datasets. The choice depends on the research question, cost, quality, and legal/ethical constraints.
Data quality has multiple dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Poor data quality has significant financial and operational consequences.
Data governance is the organizational framework (people, policies, processes, technology) that ensures data is managed as a strategic asset — covering quality, security, privacy, compliance, and ethics.
Privacy regulations (GDPR, CCPA, HIPAA, etc.) impose strict requirements on how personal data is collected, processed, stored, and shared. Non-compliance carries severe penalties.
Ethics in data science goes beyond legal compliance — practitioners must consider fairness, transparency, consent, bias, and the potential for harm in all data-related activities.
Data catalogs and lineage help organizations discover, understand, trust, and trace their data assets.
2.7 Review Questions
A hospital stores patient records in a relational database (name, DOB, diagnosis codes), medical images (X-rays, MRIs) in a file system, and doctor’s notes as free-text documents. Classify each data type as structured, semi-structured, or unstructured. What different tools/techniques would be needed to analyze each?
A startup wants to build a model predicting restaurant success in a new city. Propose a data collection strategy that uses both primary and secondary data. What specific sources would you recommend?
You receive a customer dataset with 2 million records. Describe the step-by-step data profiling process you would follow to assess its quality. What specific checks would you perform?
A company’s marketing team wants to purchase third-party data about consumer spending habits to enrich their customer profiles. What data governance and ethical considerations should be evaluated before proceeding?
Design a simple data governance framework for a mid-sized e-commerce company. What roles, policies, and tools would you recommend?
Explain how data lineage could help a data analyst debug a dashboard that is showing incorrect revenue figures.
Exercise 1: Data Profiling
Download the Titanic dataset from Kaggle and perform a complete data quality assessment using Python. Report on missing values, data types, duplicates, outliers, and inconsistencies.
Exercise 2: API Data Collection
Write a Python script that collects weather data for 5 cities using the OpenWeatherMap API, stores the results in a pandas DataFrame, and exports to CSV.
Exercise 3: Data Cleaning Pipeline
Given a messy dataset with missing values, duplicates, inconsistent formatting, and outliers, write a Python data cleaning pipeline that addresses all issues and produces an analysis-ready dataset.
2.8 Further Reading & References
DAMA International. (2017). DAMA-DMBOK: Data Management Body of Knowledge (2nd Edition). Technics Publications.
Redman, T.C. (2008). Data Driven: Profiting from Your Most Important Business Asset. Harvard Business Press.
O’Reilly. (2022). Fundamentals of Data Engineering. Reis, J. & Housley, M.
Ladley, J. (2019). Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program (2nd Edition). Academic Press.
European Commission. GDPR Official Text: https://gdpr.eu/
Great Expectations Documentation: https://greatexpectations.io/
Topic Three: Data Cleaning and Preprocessing
3.1 Handling Missing Data
Missing values can arise from data entry errors, system issues, or unavailable information.
Common approaches
Remove rows/columns
- Use when missingness is small or the feature is not important.
Imputation
Numerical data: mean, median, interpolation
Categorical data: mode, “Unknown” category
Advanced methods: KNN imputation, model-based imputation
Flag missingness
- Add a binary feature indicating whether a value was missing.
Key consideration
Understand whether data is:
MCAR: Missing Completely at Random
MAR: Missing at Random
MNAR: Missing Not at Random
3.2. Outlier Detection and Treatment
Outliers are data points that differ significantly from other observations. They can be genuine extreme cases or simple errors.
Detection methods
Statistical
Z-Score: Identifying points that fall more than 3 standard deviations from the mean)
IQR Method: Defining outliers as points falling below \(Q1 - 1.5 \times IQR\) or above
Visualization
Boxplots
Scatterplots
Histograms
Model-based
Isolation Forest
DBSCAN
Local Outlier Factor
Treatment options
Remove if clearly erroneous
Cap/winsorize extreme values
Transform data
Keep them if they are valid and meaningful
3.3 Data Transformation and Normalization
Used to make data suitable for analysis or machine learning.
Transformation techniques
Log Transformation: Used to handle skewed data and help it approximate a normal distribution.
Square root / Box-Cox / Yeo-Johnson: stabilize variance
Binning: convert continuous values into intervals
Scaling / Normalization
Most machine learning algorithms (like SVM or K-Means) are sensitive to the scale of data. If one feature ranges from 0–1 and another from 0–10,000, the larger scale will dominate the model.
Min-Max Scaling: rescales to [0,1]
Standardization: mean = 0, std = 1
Robust Scaling: uses median and IQR; useful with outliers
3.4. Feature Creation and Encoding
Computers process numbers, not text. Therefore, categorical “labels” must be converted into numerical formats
Improves model performance by making raw data more informative.
Feature creation
Date-based features: year, month, day, weekday
Aggregations: totals, averages, counts
Interaction terms: multiply or combine variables
Domain-specific engineered variables
Encoding categorical variables
Label Encoding: assign numeric labels [Assigning a unique integer to each category (e.g., Red=1, Blue=2). Best for ordinal data where order matters (e.g., Small, Medium, Large)]
One-Hot Encoding: create binary columns
Ordinal Encoding: for ordered categories
Target / Frequency Encoding: useful for high-cardinality categories
3.5 Reproducible Data Workflows
Ensures preprocessing is consistent, traceable, and reusable.
Best practices
Keep raw data unchanged
Automate preprocessing with scripts/pipelines
Document assumptions and steps
Use version control (e.g., Git)
Set random seeds for reproducibility
Use notebooks carefully; move final logic into reusable code
Track data and model versions
Tools often used
Python:
pandas,scikit-learn,numpyR: tidyverse, caret, tidymodelsWorkflow tools:
Pipeline,ColumnTransformerExperiment/data tracking: MLflow, DVC
3.6 Summary
Data cleaning and preprocessing typically include:
Fixing or imputing missing values
Detecting and handling outliers
Transforming and scaling data
Creating useful features and encoding categories
Building reproducible workflows for consistency