Study Guide: Data Mining and Data Types
1. Introduction to Data Mining
- Data Mining: The process of extracting implicit,
previously unknown, or useful information from large data sets.
- Origins:
- John Tukey: A key figure in data mining, famous for
coining the term “bit” and co-inventing the Fast Fourier Transform. His
work laid the foundation for modern data mining, particularly in data
exploration and visualization.
- Gregory Piatetsky-Shapiro: Started the Knowledge
Discovery in Databases workshop in the 1990s, which became foundational
for modern data mining.
- Data Science: Evolved from the need for companies
to hire research scientists without the “research” baggage, focusing on
analyzing data rather than academic research.
2. Data Mining Process
- Pipeline:
- Data Selection: Identifying the relevant data.
- Data Preprocessing: Cleaning and organizing the
data.
- Data Transformation: Converting data into a
suitable format for mining.
- Application of Algorithms: Implementing data mining
techniques.
- Interpretation and Evaluation: Analyzing and
evaluating the results, differentiating data mining from machine
learning by making data interpretable by humans.
3. Data Mining Tasks
- Categories:
- Prediction Tasks: Predicting unknown or future
values (e.g., classification, regression, outlier detection).
- Descriptive Tasks: Making data human-interpretable
(e.g., clustering, association rule discovery, sequential pattern
discovery).
a. Classification
- Definition: Predicting the class of unseen records
based on known attributes.
- Example: Predicting whether a consumer will buy a
product based on demographic information.
- Applications: Marketing, astronomy (e.g.,
classifying galaxies).
b. Regression
- Definition: Predicting a continuous value rather
than a class.
- Example: Predicting deep adipose tissue based on
height and weight.
c. Clustering
- Definition: Grouping similar data points into
clusters, separating distinct clusters.
- Example: Document clustering, such as categorizing
news articles by topic.
d. Association Rule Discovery
- Definition: Discovering rules that show
relationships between items in a dataset.
- Example: Identifying that customers who buy diapers
and milk are also likely to buy beer.
4. Understanding Data and Attributes
- Data Structure:
- Object: A collection of data (rows in a
table).
- Attributes: Properties of the data (columns in a
table).
- Attribute Terminology:
- Attributes: Also known as variables, fields,
characteristics, or features.
- Objects: Also referred to as records, points,
samples, instances, cases, or entities.
5. Types of Attributes
- Categories:
- Discrete Attributes: Nominal and ordinal.
- Nominal: Distinct categories with no order (e.g.,
eye color).
- Ordinal: Ordered categories (e.g., movie
ratings).
- Continuous Attributes: Interval and ratio.
- Interval: Ordered with equal intervals but no true
zero (e.g., temperature in Celsius).
- Ratio: Ordered with a true zero (e.g., temperature
in Kelvin).
- Transformations:
- Nominal: Any permutation is valid.
- Ordinal: Monotonic transformations are valid.
- Interval: Linear transformations are valid (e.g.,
Celsius to Fahrenheit).
- Ratio: Only multiplication by a constant is
valid.
6. Types of Data Sets
- Record Data: Data represented in a table format
(e.g., relational databases).
- Transaction Data: A subset of record data where not
every column needs a value, resembling document-style databases.
- Graph Data: Represents data points and their
relationships (e.g., hyperlinks on the web).
- Ordered Data: Preserves the temporal or sequential
nature of data (e.g., sensor data with timestamps).
Self-Test Questions
- Who is considered the founder of modern data mining?
- What distinguishes data mining from machine learning?
- Describe the difference between classification and regression
tasks.
- What is the significance of clustering in data mining?
- Give an example of an association rule that might be discovered in a
grocery store data set.
- Define what constitutes an object and an attribute in a data
set.
- Differentiate between nominal and ordinal attributes.
- Explain the difference between interval and ratio attributes.
- What is transaction data and how does it differ from record
data?
- Why is ordered data important in the context of sensor data?
Self - Test Answers:
Founder of Modern Data Mining: Gregory
Piatetsky-Shapiro He is a well-known data scientist and the founder of
KDnuggets, a popular website and newsletter on data science and machine
learning. Conduscted first workshops in 1980’s and 90’s.
Piatetsky-Shapiro has made significant contributions to the field
of data mining, including:
Co-founding the KDD (Knowledge Discovery and Data Mining)
conference series: This conference series is one of the most
prestigious and influential in the field of data mining and knowledge
discovery.
Developing the concept of knowledge discovery in
databases (KDD): Piatetsky-Shapiro’s work on KDD helped
establish data mining as a distinct field of research and
practice.
Publishing numerous papers and books on data
mining: His publications have helped shape the field of data
mining and have been widely cited and influential.
What distinguishes data mining from machine
learning?
- Data Mining: Focuses on discovering patterns,
relationships, and insights from large datasets, often using automated
methods. Data mining aims to extract valuable information or knowledge
from data, usually without a specific prediction or decision-making goal
in mind; It’s the process of discovering patterns and knowledge from
large amounts of data. The data sources can include databases, data
warehouses, the internet, etc
- Machine Learning: Focuses on developing algorithms
and models that enable computers to learn from data and make predictions
or decisions. Machine learning is a key aspect of data mining, as it
provides the tools and techniques for analyzing and interpreting data.
Machine learning is a subset of artificial intelligence that involves
the development of algorithms that allow computers to learn from and
make predictions or decisions based on data.
- While data mining is concerned with extracting patterns &
insights from data, machine learning is concerned with using those
insights to make predictions and/or decisions
Classification vs. Regression Tasks:
- Classification: Involves predicting a categorical
label or class that an instance belongs to, based on its features.
Classification tasks typically involve a fixed set of classes, and the
goal is to assign each instance to one of those classes. Examples
include spam vs. non-spam emails, cancer diagnosis (malignant
vs. benign), and product recommendation (e.g., recommending a product
based on user behavior).
- Regression: Involves predicting a continuous value
or numerical outcome based on the input features. Regression tasks aim
to predict a specific value, such as a house price, stock price, or
energy consumption.
To illustrate the difference:
- Classification: “Is this email spam or not?” ( categorical
label)
- Regression: “What is the predicted house price based on its
features?” (continuous value)
3. What is the significance of clustering in data
mining?
Clustering is a type of unsupervised learning technique in data
mining that involves grouping similar instances or data points into
clusters, based on their features. The significance of clustering lies
in its ability to:
- Identify patterns and structures: Clustering helps
reveal hidden patterns and relationships in the data, which can be
useful for understanding the underlying structure of the data.
- Segment data: Clustering enables the segmentation
of data into distinct groups, which can be useful for targeted
marketing, customer profiling, or identifying outliers.
- Anomaly detection: Clustering can help identify
unusual or anomalous data points that do not fit into any cluster, which
can be useful for detecting errors, fraud, or unusual behavior.
- Data reduction: Clustering can reduce the
dimensionality of the data by representing each cluster with a single
point or centroid, making it easier to analyze and visualize the
data.
Overall, clustering is a powerful technique in data mining that can
help uncover insights, identify patterns, and support
decision-making.
Significance of Clustering in Data Mining:
Clustering is significant because it helps in identifying natural
groupings within data. It is used to discover structure in data without
having prior labels. This can be useful in market segmentation, social
network analysis, organization of computing clusters, etc.
Example of an Association Rule: In a grocery
store dataset, an example of an association rule might be: “If a
customer buys bread, they are 70% likely to also buy butter.” This rule
helps in understanding customer buying patterns.
Object and Attribute in a Data Set:
- Object: An object is an entity that contains data.
In a dataset, each row typically represents an object. For example, in a
dataset of students, each student is an object.
- Attribute: An attribute is a property, feature or
characteristic of an object. In the student dataset, attributes might
include name, age, grade, etc.
Nominal vs. Ordinal Attributes:
- Nominal Attributes: These are categorical
attributes without any order. For example, colors like red, blue,
green.
- Ordinal Attributes: These are categorical
attributes with a meaningful order but no fixed interval between them.
For example, rankings like first, second, third.
Interval vs. Ratio Attributes:
- Interval Attributes: These have meaningful
intervals between values but no true zero point. For example,
temperature in Celsius.
- Ratio Attributes: These have both meaningful
intervals and a true zero point, allowing for the calculation of ratios.
For example, weight or height.
Transaction Data vs. Record Data:
- Transaction Data: This refers to data that captures
transactions, typically involving a time component and multiple items.
For example, a sales transaction in a store.
- Record Data: This is structured data where each
record is a fixed set of fields. For example, a database table where
each row is a record.
Importance of Ordered Data in Sensor Data:
Ordered data is crucial in sensor data because it often involves
time-series data where the sequence of data points is important. This
order can reveal trends, patterns, and anomalies over time, which are
essential for monitoring and analysis in applications like weather
forecasting, health monitoring, and industrial automation.
Version 2
Study Guide: Understanding Data Mining and Data
Concepts
1. Introduction to Data Mining
- Definition: Data mining is the non-trivial
extraction of implicit, previously unknown, or useful information from
large data sets.
- Historical Background:
- Foundations: Data mining is built upon three
disciplines: Statistics, Artificial Intelligence (AI), and Machine
Learning (ML).
- Key Figure: John Tukey, considered a founder of
modern data mining, contributed to data exploration, visualization, and
pre-processing techniques.
- Evolution: The term “data mining” emerged in the
1990s, and the field has grown with the rise of “data science,” a term
coined to differentiate between research-focused roles and practical
data analysis roles in companies.
2. Data Mining Processes and Tasks
- Pipeline:
- Data Selection: Choosing relevant data for
analysis.
- Pre-processing: Cleaning and preparing the
data.
- Transformation: Converting data into a suitable
format.
- Data Mining: Applying algorithms to extract
patterns.
- Interpretation & Evaluation: Making the results
understandable and actionable for humans.
- Key Distinctions:
- Data Mining vs. Machine Learning: Data mining
focuses on making patterns interpretable by humans, while ML is more
about automated analysis.
- Categories of Data Mining Tasks:
- Prediction Tasks: Involves predicting unknown or
future values (e.g., classification, regression).
- Descriptive Tasks: Focuses on making data
human-interpretable (e.g., clustering, association rule discovery).
3. Examples of Data Mining Techniques
- Classification:
- Purpose: Predicting the class of previously unseen
records.
- Example: Predicting consumer behavior based on
demographics or classifying galaxies by their formation stage.
- Regression:
- Purpose: Predicting a continuous value rather than
a class.
- Example: Estimating deep adipose tissue based on
height and weight.
- Clustering:
- Purpose: Grouping similar data points into
clusters.
- Example: Document clustering for categorizing news
articles into genres.
- Association Rule Discovery:
- Purpose: Finding rules that associate different
attributes in a data set.
- Example: Market basket analysis, like discovering
that buying diapers is often associated with buying beer.
4. Understanding Data
- Data Structure:
- Objects and Attributes:
- An object is a collection of attributes (e.g., a
row in a table).
- Attributes (also known as variables, fields,
features) represent the properties of the data (e.g., columns in a
table).
- Terminology:
- Object Synonyms: Records, points, samples,
instances, cases, entities.
- Attribute Synonyms: Variables, fields,
characteristics, features.
5. Types of Attributes
- Discrete Attributes:
- Nominal: Distinct categories without order (e.g.,
eye color).
- Ordinal: Categories with a specific order (e.g.,
movie ratings).
- Continuous Attributes:
- Interval: Ordered with equal intervals, no true
zero (e.g., temperature in Celsius).
- Ratio: Ordered with a true zero, supports
multiplication (e.g., height).
6. Types of Data Sets
- Record Data:
- Structure: Data organized in rows and columns,
similar to relational databases.
- Transaction Data: A subset where not every column
needs a value, akin to document-style databases.
- Graph Data:
- Structure: Data represented as nodes and edges,
often used for relationships (e.g., hyperlinks on the web).
- Sequential Data:
- Structure: Data with a preserved order, often
including timestamps (e.g., sensor data).
7. Self-Test Questions
- Reflect on the differences between data mining tasks (e.g.,
classification vs. clustering) and the types of attributes (e.g.,
nominal vs. interval).
- Consider examples of different data set types and how they are
structured.
Self-Test Answers:
Reflecting on the differences between data mining tasks and
types of attributes involves understanding how each task
utilizes data attributes and how datasets are structured to facilitate
these tasks.
Data Mining Tasks
- Classification:
- Purpose: Assigns items to predefined categories or
classes.
- Attributes: Often involves nominal or ordinal
attributes as the target variable (e.g., class labels like ‘spam’ or
‘not spam’).
- Example: Email classification where the dataset
includes attributes like email content, sender, and time, with the
target attribute being the class label (spam/not spam).
- Clustering:
- Purpose: Groups similar items together without
predefined labels.
- Attributes: Can involve any type of attribute, but
often uses numerical (interval or ratio) attributes for calculating
distances or similarities.
- Example: Customer segmentation where the dataset
includes attributes like purchase history, age, and location, and the
goal is to group customers into segments based on similarity.
Types of Attributes
- Nominal Attributes:
- Characteristics: Categorical with no inherent
order.
- Example: Colors (red, blue, green) or product
categories (electronics, clothing, groceries).
- Ordinal Attributes:
- Characteristics: Categorical with a meaningful
order but no fixed interval.
- Example: Customer satisfaction ratings (poor, fair,
good, excellent).
- Interval Attributes:
- Characteristics: Numerical with meaningful
intervals but no true zero.
- Example: Temperature in Celsius or Fahrenheit.
- Ratio Attributes:
- Characteristics: Numerical with meaningful
intervals and a true zero.
- Example: Height, weight, or age.
Examples of Different Data Set Types
- Transactional Data Set:
- Structure: Each record represents a transaction,
often with a timestamp and multiple items.
- Example: A retail sales dataset where each row
includes transaction ID, date, items purchased, and total amount.
- Time-Series Data Set:
- Structure: Ordered data points indexed by
time.
- Example: Stock prices dataset where each row
includes date, opening price, closing price, and volume.
- Spatial Data Set:
- Structure: Data with spatial attributes, often
including coordinates.
- Example: Geographic information system (GIS) data
where each record includes location coordinates and attributes like
elevation or land use type.
- Textual Data Set:
- Structure: Unstructured or semi-structured data,
often requiring preprocessing.
- Example: A collection of documents or social media
posts where each entry includes text content and metadata like author
and timestamp.
Questions from Module 1 Asynch:
- If predicting from other customers, dividing up customers by
potential profitability is:
- Classification and regression (Correct)
- Dividing up customers by potential profitability is
Classification because it involves categorizing
customers into groups based on their potential profitability and
predicting categorical (classification) or continuous (regression)
outcomes.
- Extracting frequency of sound is:
- Not data mining (Correct)
- Extracting frequency of sound is Not data mining
because it is a signal processing task, not a data mining task.
- Finding someone’s adipose tissue measure from waist
circumference is:
- Regression (Correct)
- Finding someone’s adipose tissue measure from waist circumference is
Regression because it involves predicting a continuous
value (adipose tissue measure) based on a single input (another
variable) variable (waist circumference).
- Deciding if a person has diabetes based upon his or her
history and diet is:
- Classification (Correct)
- Classification Task because it involves
categorizing a person as either having diabetes or not. (presence or
absence of).
- Depending on the methods used, deciding if someone will like
a movie on Netflix given ratings of other users is:
Answers A through D (Correct)
Deciding if someone will like a movie on Netflix given ratings of
other users is Answers A through D because it can be
approached using various methods, including Association (e.g.,
collaborative filtering), Classification (e.g., predicting a user’s
rating), Clustering (e.g., grouping similar users), or Regression (e.g.,
predicting a user’s rating).
- Depending on the methods used, finding the genre of an
online article based on the words in it is:
- Classification or clustering (Correct)
- Finding the genre of an online article based on the words in it is
Classification or Clustering because it can be
approached using either Classification (e.g., predicting a specific
genre) or Clustering (e.g., grouping similar articles).
- Depending on the methods used, finding which Google searches
are likely to follow one another is:
- Classification or clustering (Correct)
- It could involve classification (predicting the next search) or
clustering (grouping similar searches).
- Is the brightness from a light meter interval or
ratio?
- Ratio (Correct)
- Brightness from a light meter is Ratio because it
has a true zero point (i.e., complete darkness) and can be measured in a
continuous range.
- Is an angle measured from 0–360 degrees interval or
ratio?
- Could be either interval or ratio (Correct)
- Depending on context, angles could be treated as interval (if
there’s no meaningful zero) or ratio (if 0 degrees is absolute).
- It can be treated as either an interval scale (e.g., when measuring
angles between 0 and 360 degrees) or a ratio scale (e.g., when measuring
angles in radians).
- Is the height above sea level interval or ratio?
- Ratio (Correct)
- Height above sea level has an absolute zero (sea level), and can be
measured in a continuous range, making it a Ratio
Measurement.
- Is military ranking ordinal, nominal, or binary?
- Ordinal (Correct)
- Military ranking is Ordinal because it represents a
ranked /specific order (e.g., private, sergeant, lieutenant), but the
differences between ranks are not necessarily equal.
- Is a coat check number ordinal, nominal, or binary?
- Nominal (Correct)
- A coat check number is a label with no inherent order or numerical
value, making it nominal.
- Is time as a.m. or p.m. ordinal, nominal, or
binary?
- Binary (Correct)
- Time as a.m. or p.m. is Binary because it
represents a simple dichotomy (2 categories, i.e., morning or
afternoon).
- Is the brightness, as given by a human, ordinal, nominal, or
binary?
- Ordinal (Correct)
- Human-perceived brightness is Ordinal because it
represents a subjective ranking (e.g., dim, medium, bright), but the
differences between ranks are not necessarily equal.
- What best describes a database of movies and actors who
played in them?
- Either transaction or graph (Correct)
- A database of movies and actors who played in them is Either
transaction or graph because it can be represented as either a
set of transactions (e.g., actor X played in movie Y) or a graph (e.g.,
actors connected to movies).
- What best describes a database of someone’s Amazon purchase
history?
- Either transaction or graph (Correct)
- A database of someone’s Amazon purchase history is Either
transaction or sequential because it can be represented as
either a set of transactions (e.g., purchase X on date Y) or a sequence
of purchases over time.
- What best describes a data set of people’s attributes with
diabetes?
- Record (Correct)
- A data set of people’s attributes with diabetes is
Record because it represents a collection of individual
records or cases, each with various attributes (e.g., age, blood
pressure, etc.).
- What best describes a database of your movements while
sleeping, recorded each minute?
- Sequential (Correct)
- The data has a temporal sequence, making it sequential.
- Sequential because it represents a sequence of
measurements over time.
Think I’m good here. Set alert for weekly review and homework check
every Wednesday at noon.
