Study Guide: Data Mining and Data Types


1. Introduction to Data Mining

2. Data Mining Process

3. Data Mining Tasks

a. Classification

  • Definition: Predicting the class of unseen records based on known attributes.
  • Example: Predicting whether a consumer will buy a product based on demographic information.
  • Applications: Marketing, astronomy (e.g., classifying galaxies).

b. Regression

  • Definition: Predicting a continuous value rather than a class.
  • Example: Predicting deep adipose tissue based on height and weight.

c. Clustering

  • Definition: Grouping similar data points into clusters, separating distinct clusters.
  • Example: Document clustering, such as categorizing news articles by topic.

d. Association Rule Discovery

  • Definition: Discovering rules that show relationships between items in a dataset.
  • Example: Identifying that customers who buy diapers and milk are also likely to buy beer.

4. Understanding Data and Attributes

5. Types of Attributes

6. Types of Data Sets


Self-Test Questions

  1. Who is considered the founder of modern data mining?
  2. What distinguishes data mining from machine learning?
  3. Describe the difference between classification and regression tasks.
  4. What is the significance of clustering in data mining?
  5. Give an example of an association rule that might be discovered in a grocery store data set.
  6. Define what constitutes an object and an attribute in a data set.
  7. Differentiate between nominal and ordinal attributes.
  8. Explain the difference between interval and ratio attributes.
  9. What is transaction data and how does it differ from record data?
  10. Why is ordered data important in the context of sensor data?

Self - Test Answers:

  1. Founder of Modern Data Mining: Gregory Piatetsky-Shapiro He is a well-known data scientist and the founder of KDnuggets, a popular website and newsletter on data science and machine learning. Conduscted first workshops in 1980’s and 90’s.

    1. Piatetsky-Shapiro has made significant contributions to the field of data mining, including:

      1. Co-founding the KDD (Knowledge Discovery and Data Mining) conference series: This conference series is one of the most prestigious and influential in the field of data mining and knowledge discovery.

      2. Developing the concept of knowledge discovery in databases (KDD): Piatetsky-Shapiro’s work on KDD helped establish data mining as a distinct field of research and practice.

      3. Publishing numerous papers and books on data mining: His publications have helped shape the field of data mining and have been widely cited and influential.

  2. What distinguishes data mining from machine learning?

    • Data Mining: Focuses on discovering patterns, relationships, and insights from large datasets, often using automated methods. Data mining aims to extract valuable information or knowledge from data, usually without a specific prediction or decision-making goal in mind; It’s the process of discovering patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the internet, etc
    • Machine Learning: Focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions. Machine learning is a key aspect of data mining, as it provides the tools and techniques for analyzing and interpreting data. Machine learning is a subset of artificial intelligence that involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data.
    • While data mining is concerned with extracting patterns & insights from data, machine learning is concerned with using those insights to make predictions and/or decisions
  3. Classification vs. Regression Tasks:

    • Classification: Involves predicting a categorical label or class that an instance belongs to, based on its features. Classification tasks typically involve a fixed set of classes, and the goal is to assign each instance to one of those classes. Examples include spam vs. non-spam emails, cancer diagnosis (malignant vs. benign), and product recommendation (e.g., recommending a product based on user behavior).
    • Regression: Involves predicting a continuous value or numerical outcome based on the input features. Regression tasks aim to predict a specific value, such as a house price, stock price, or energy consumption.

    To illustrate the difference:

    • Classification: “Is this email spam or not?” ( categorical label)
    • Regression: “What is the predicted house price based on its features?” (continuous value)

    3. What is the significance of clustering in data mining?

    Clustering is a type of unsupervised learning technique in data mining that involves grouping similar instances or data points into clusters, based on their features. The significance of clustering lies in its ability to:

    • Identify patterns and structures: Clustering helps reveal hidden patterns and relationships in the data, which can be useful for understanding the underlying structure of the data.
    • Segment data: Clustering enables the segmentation of data into distinct groups, which can be useful for targeted marketing, customer profiling, or identifying outliers.
    • Anomaly detection: Clustering can help identify unusual or anomalous data points that do not fit into any cluster, which can be useful for detecting errors, fraud, or unusual behavior.
    • Data reduction: Clustering can reduce the dimensionality of the data by representing each cluster with a single point or centroid, making it easier to analyze and visualize the data.

    Overall, clustering is a powerful technique in data mining that can help uncover insights, identify patterns, and support decision-making.

  4. Significance of Clustering in Data Mining: Clustering is significant because it helps in identifying natural groupings within data. It is used to discover structure in data without having prior labels. This can be useful in market segmentation, social network analysis, organization of computing clusters, etc.

  5. Example of an Association Rule: In a grocery store dataset, an example of an association rule might be: “If a customer buys bread, they are 70% likely to also buy butter.” This rule helps in understanding customer buying patterns.

  6. Object and Attribute in a Data Set:

    • Object: An object is an entity that contains data. In a dataset, each row typically represents an object. For example, in a dataset of students, each student is an object.
    • Attribute: An attribute is a property, feature or characteristic of an object. In the student dataset, attributes might include name, age, grade, etc.
  7. Nominal vs. Ordinal Attributes:

    • Nominal Attributes: These are categorical attributes without any order. For example, colors like red, blue, green.
    • Ordinal Attributes: These are categorical attributes with a meaningful order but no fixed interval between them. For example, rankings like first, second, third.
  8. Interval vs. Ratio Attributes:

    • Interval Attributes: These have meaningful intervals between values but no true zero point. For example, temperature in Celsius.
    • Ratio Attributes: These have both meaningful intervals and a true zero point, allowing for the calculation of ratios. For example, weight or height.
  9. Transaction Data vs. Record Data:

    • Transaction Data: This refers to data that captures transactions, typically involving a time component and multiple items. For example, a sales transaction in a store.
    • Record Data: This is structured data where each record is a fixed set of fields. For example, a database table where each row is a record.
  10. Importance of Ordered Data in Sensor Data: Ordered data is crucial in sensor data because it often involves time-series data where the sequence of data points is important. This order can reveal trends, patterns, and anomalies over time, which are essential for monitoring and analysis in applications like weather forecasting, health monitoring, and industrial automation.

Version 2

Study Guide: Understanding Data Mining and Data Concepts

1. Introduction to Data Mining

  • Definition: Data mining is the non-trivial extraction of implicit, previously unknown, or useful information from large data sets.
  • Historical Background:
    • Foundations: Data mining is built upon three disciplines: Statistics, Artificial Intelligence (AI), and Machine Learning (ML).
    • Key Figure: John Tukey, considered a founder of modern data mining, contributed to data exploration, visualization, and pre-processing techniques.
    • Evolution: The term “data mining” emerged in the 1990s, and the field has grown with the rise of “data science,” a term coined to differentiate between research-focused roles and practical data analysis roles in companies.

2. Data Mining Processes and Tasks

  • Pipeline:
    1. Data Selection: Choosing relevant data for analysis.
    2. Pre-processing: Cleaning and preparing the data.
    3. Transformation: Converting data into a suitable format.
    4. Data Mining: Applying algorithms to extract patterns.
    5. Interpretation & Evaluation: Making the results understandable and actionable for humans.
  • Key Distinctions:
    • Data Mining vs. Machine Learning: Data mining focuses on making patterns interpretable by humans, while ML is more about automated analysis.
  • Categories of Data Mining Tasks:
    1. Prediction Tasks: Involves predicting unknown or future values (e.g., classification, regression).
    2. Descriptive Tasks: Focuses on making data human-interpretable (e.g., clustering, association rule discovery).

3. Examples of Data Mining Techniques

  • Classification:
    • Purpose: Predicting the class of previously unseen records.
    • Example: Predicting consumer behavior based on demographics or classifying galaxies by their formation stage.
  • Regression:
    • Purpose: Predicting a continuous value rather than a class.
    • Example: Estimating deep adipose tissue based on height and weight.
  • Clustering:
    • Purpose: Grouping similar data points into clusters.
    • Example: Document clustering for categorizing news articles into genres.
  • Association Rule Discovery:
    • Purpose: Finding rules that associate different attributes in a data set.
    • Example: Market basket analysis, like discovering that buying diapers is often associated with buying beer.

4. Understanding Data

  • Data Structure:
    • Objects and Attributes:
      • An object is a collection of attributes (e.g., a row in a table).
      • Attributes (also known as variables, fields, features) represent the properties of the data (e.g., columns in a table).
  • Terminology:
    • Object Synonyms: Records, points, samples, instances, cases, entities.
    • Attribute Synonyms: Variables, fields, characteristics, features.

5. Types of Attributes

  • Discrete Attributes:
    • Nominal: Distinct categories without order (e.g., eye color).
    • Ordinal: Categories with a specific order (e.g., movie ratings).
  • Continuous Attributes:
    • Interval: Ordered with equal intervals, no true zero (e.g., temperature in Celsius).
    • Ratio: Ordered with a true zero, supports multiplication (e.g., height).

6. Types of Data Sets

  • Record Data:
    • Structure: Data organized in rows and columns, similar to relational databases.
    • Transaction Data: A subset where not every column needs a value, akin to document-style databases.
  • Graph Data:
    • Structure: Data represented as nodes and edges, often used for relationships (e.g., hyperlinks on the web).
  • Sequential Data:
    • Structure: Data with a preserved order, often including timestamps (e.g., sensor data).

7. Self-Test Questions

  • Reflect on the differences between data mining tasks (e.g., classification vs. clustering) and the types of attributes (e.g., nominal vs. interval).
  • Consider examples of different data set types and how they are structured.

Self-Test Answers:

Reflecting on the differences between data mining tasks and types of attributes involves understanding how each task utilizes data attributes and how datasets are structured to facilitate these tasks.

Data Mining Tasks

  1. Classification:
    • Purpose: Assigns items to predefined categories or classes.
    • Attributes: Often involves nominal or ordinal attributes as the target variable (e.g., class labels like ‘spam’ or ‘not spam’).
    • Example: Email classification where the dataset includes attributes like email content, sender, and time, with the target attribute being the class label (spam/not spam).
  2. Clustering:
    • Purpose: Groups similar items together without predefined labels.
    • Attributes: Can involve any type of attribute, but often uses numerical (interval or ratio) attributes for calculating distances or similarities.
    • Example: Customer segmentation where the dataset includes attributes like purchase history, age, and location, and the goal is to group customers into segments based on similarity.

Types of Attributes

  1. Nominal Attributes:
    • Characteristics: Categorical with no inherent order.
    • Example: Colors (red, blue, green) or product categories (electronics, clothing, groceries).
  2. Ordinal Attributes:
    • Characteristics: Categorical with a meaningful order but no fixed interval.
    • Example: Customer satisfaction ratings (poor, fair, good, excellent).
  3. Interval Attributes:
    • Characteristics: Numerical with meaningful intervals but no true zero.
    • Example: Temperature in Celsius or Fahrenheit.
  4. Ratio Attributes:
    • Characteristics: Numerical with meaningful intervals and a true zero.
    • Example: Height, weight, or age.

Examples of Different Data Set Types

  1. Transactional Data Set:
    • Structure: Each record represents a transaction, often with a timestamp and multiple items.
    • Example: A retail sales dataset where each row includes transaction ID, date, items purchased, and total amount.
  2. Time-Series Data Set:
    • Structure: Ordered data points indexed by time.
    • Example: Stock prices dataset where each row includes date, opening price, closing price, and volume.
  3. Spatial Data Set:
    • Structure: Data with spatial attributes, often including coordinates.
    • Example: Geographic information system (GIS) data where each record includes location coordinates and attributes like elevation or land use type.
  4. Textual Data Set:
    • Structure: Unstructured or semi-structured data, often requiring preprocessing.
    • Example: A collection of documents or social media posts where each entry includes text content and metadata like author and timestamp.

Questions from Module 1 Asynch:

  1. If predicting from other customers, dividing up customers by potential profitability is:
    • Classification and regression (Correct)
    • Dividing up customers by potential profitability is Classification because it involves categorizing customers into groups based on their potential profitability and predicting categorical (classification) or continuous (regression) outcomes.
  2. Extracting frequency of sound is:
    • Not data mining (Correct)
    • Extracting frequency of sound is Not data mining because it is a signal processing task, not a data mining task.
  3. Finding someone’s adipose tissue measure from waist circumference is:
    • Regression (Correct)
    • Finding someone’s adipose tissue measure from waist circumference is Regression because it involves predicting a continuous value (adipose tissue measure) based on a single input (another variable) variable (waist circumference).
  4. Deciding if a person has diabetes based upon his or her history and diet is:
    • Classification (Correct)
    • Classification Task because it involves categorizing a person as either having diabetes or not. (presence or absence of).
  5. Depending on the methods used, deciding if someone will like a movie on Netflix given ratings of other users is:
    • Answers A through D (Correct)

      Deciding if someone will like a movie on Netflix given ratings of other users is Answers A through D because it can be approached using various methods, including Association (e.g., collaborative filtering), Classification (e.g., predicting a user’s rating), Clustering (e.g., grouping similar users), or Regression (e.g., predicting a user’s rating).

  6. Depending on the methods used, finding the genre of an online article based on the words in it is:
    • Classification or clustering (Correct)
    • Finding the genre of an online article based on the words in it is Classification or Clustering because it can be approached using either Classification (e.g., predicting a specific genre) or Clustering (e.g., grouping similar articles).
  7. Depending on the methods used, finding which Google searches are likely to follow one another is:
    • Classification or clustering (Correct)
    • It could involve classification (predicting the next search) or clustering (grouping similar searches).
  8. Is the brightness from a light meter interval or ratio?
    • Ratio (Correct)
    • Brightness from a light meter is Ratio because it has a true zero point (i.e., complete darkness) and can be measured in a continuous range.
  9. Is an angle measured from 0–360 degrees interval or ratio?
    • Could be either interval or ratio (Correct)
    • Depending on context, angles could be treated as interval (if there’s no meaningful zero) or ratio (if 0 degrees is absolute).
    • It can be treated as either an interval scale (e.g., when measuring angles between 0 and 360 degrees) or a ratio scale (e.g., when measuring angles in radians).
  10. Is the height above sea level interval or ratio?
    • Ratio (Correct)
    • Height above sea level has an absolute zero (sea level), and can be measured in a continuous range, making it a Ratio Measurement.
  11. Is military ranking ordinal, nominal, or binary?
    • Ordinal (Correct)
    • Military ranking is Ordinal because it represents a ranked /specific order (e.g., private, sergeant, lieutenant), but the differences between ranks are not necessarily equal.
  12. Is a coat check number ordinal, nominal, or binary?
    • Nominal (Correct)
    • A coat check number is a label with no inherent order or numerical value, making it nominal.
  13. Is time as a.m. or p.m. ordinal, nominal, or binary?
    • Binary (Correct)
    • Time as a.m. or p.m. is Binary because it represents a simple dichotomy (2 categories, i.e., morning or afternoon).
  14. Is the brightness, as given by a human, ordinal, nominal, or binary?
    • Ordinal (Correct)
    • Human-perceived brightness is Ordinal because it represents a subjective ranking (e.g., dim, medium, bright), but the differences between ranks are not necessarily equal.
  15. What best describes a database of movies and actors who played in them?
    • Either transaction or graph (Correct)
    • A database of movies and actors who played in them is Either transaction or graph because it can be represented as either a set of transactions (e.g., actor X played in movie Y) or a graph (e.g., actors connected to movies).
  16. What best describes a database of someone’s Amazon purchase history?
    • Either transaction or graph (Correct)
    • A database of someone’s Amazon purchase history is Either transaction or sequential because it can be represented as either a set of transactions (e.g., purchase X on date Y) or a sequence of purchases over time.
  17. What best describes a data set of people’s attributes with diabetes?
    • Record (Correct)
    • A data set of people’s attributes with diabetes is Record because it represents a collection of individual records or cases, each with various attributes (e.g., age, blood pressure, etc.).
  18. What best describes a database of your movements while sleeping, recorded each minute?
    • Sequential (Correct)
    • The data has a temporal sequence, making it sequential.
    • Sequential because it represents a sequence of measurements over time.

Think I’m good here. Set alert for weekly review and homework check every Wednesday at noon.

