IT408 / IT408 SC:
Data Mining

Unit 1: Introduction

R Batzinger

2026-06-08

IT408 / IT408SC
Data Mining

Course Details

INSTRUCTOR: A.Dr. Robert P. Batzinger
COURSE DESCRIPTION: Fundamental data mining concepts, data mining technique, followed by more advanced concepts and algorithms. (R-based)
VENUE: PC301 Mon / Thu 9-12
MIDTERM: 2 Jul Jun 2026 TIME 10:00 - 12:00
FINAL: 24 Jul 2026 TIME 13:00 - 16:00 (C)
There will be a 1-hr lab test where student will be expected to create an accurate assessment of a dataset and its potential

Course Goals:

This course emphasizes the development of fundamental programming skills, critical analysis, and problem-solving abilities resulting in the creation of well-documented, readable, and maintainable code.
Students are expected to understand the code they submit and be able to explain their design choices and implementation details.
Grade Oral reassessment takes precedent in grading.

Course Textbooks:

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund, R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Garrett Grolemund, 2014. Hands-On Programming with R. O’Reilly Press (https://rstudio-education.github.io/hopr/)

IT408 SC vs IT408

Characteristic	IT 408 SC	IT 408
Midterm & Finals	Same	Same
3 Wed Sessions on Sentiment analysis	(Optional)	Required
General Lab test	Required	Required
Sentiment analysis lab test	(Extra credit)	Required
Course hours	30	45

Schedule

June

Su	Mn	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	[8]	9	10	[11]	12	13
14	[15]	16	17	[18]	19	20
21	[22]	23	24	[25]	26	27
28	[29]	30

July

Su	Mn	Tu	We	Th	Fr	Sa
			(1)*	[[2]]	3	4
5	[6]	7	(8)*	[9]	10	11
12	[13]	14	(15)*	[16]L	17	18
21	[20]	21	22	[23]	[[24]]	25
26	27	28	29	30	31

* IT408 Special Studies; L - Lab test

Assessment

Class Participation 5%
Assignments, quizzes 15%
Midterm Exam 40%
Final Exam 40%

Grading

\[\matrix{ Score & Grade \cr \hline 100 \ge 80 & A\ \ \cr 80 \ge 75 & B^+ \cr 75 \ge 70 & B\ \ \cr 70 \ge 65 & C^+ \cr 65 \ge 60 & C\ \ \cr 60 \ge 55 & D^+ \cr 55 \ge 50 & D\ \ \cr 50 \ge 0 & F \ \ \cr }\]

Late penalty

\[\matrix{ Days late & Grade reduced \\ \hline 0 & 0\% \cr 1 & 25\% \cr 2 & 50\% \cr 3 & 75\% \cr \gt 3 & 100\%\cr}\]

Code Assessment

Ignite Code Presentation: 5-minute with 20 slides that automatically advance class explaining key aspects of the program:
- Describe what the program was designed to do
- Decsribe the algorithm and key implementation choices
- Highlight of the software
- Describe any novel approaches taken
- Sample input and output
- Current limitations of the software
A dynamic research notebook to provide the background information for a research paper
- Introduction: Capture the rationale, key goals and objectives of the project
- Methodology: Descriptions and citations about the data sources and methodologies used, research steps and working code broken into documented fragments
- Results: Comparison of research outcomes to expectation
- Discussion: Intepretation of the results and possible future research directions

Red Flags for AI-Generated Code

Automatic point deductions may apply for:

Code that student cannot explain or modify
Sophisticated techniques not covered in class without explanation
Inconsistent coding style within the same assignment
Comments that don’t match the actual code
Over-engineering for simple problems
Generic variable names throughout (e.g., data, result, output)
Perfect code with no evidence of iteration or debugging
Ignite presentation that goes over 5min or fails to cover the essentials

Ideal Software

Correctness & Functionality

Program is fully functional, robust, and correctly implements all specified requirements, including edge cases.
Handles unexpected inputs gracefully.
Output is accurate and consistently matches expected results.

Code Quality & Readability

Code is exceptionally clean, well-organized, and easy to understand.
Follows established coding conventions (e.g., naming, indentation, spacing) consistently.
Uses appropriate data structures and algorithms.
Functions/methods are cohesive and have clear responsibilities.
Minimal duplication.

Documentation & Comments

Comprehensive and clear documentation.
Program-level documentation (e.g., header comments explaining purpose, author, date, usage) is present and informative.
Functions/methods are well-documented, explaining parameters, return values, and purpose.
In-line comments explain complex logic effectively where necessary, avoiding redundancy.
Notebook file is thorough.

Problem-Solving & Design

Demonstrates excellent understanding of the problem.
Solution is elegant, efficient, and well-structured, reflecting thoughtful design choices.
Breaks down the problem into logical, manageable components.

Originality & Understanding

Demonstrates clear individual effort and a deep understanding of the submitted code.
Can articulate design decisions, explain specific lines of code, and debug effectively.
Code contains unique elements or a distinctive approach that strongly suggests student authorship.

Note: Professional understanding and integrity are assumed to accompany authorship and will be constantly tested throughout this course.

Software Installation

R 4.6.0
- Download: https://cran.r-project.org/bin/
RStudio
- Download: https://posit.co/download/rstudio-desktop

Characteristics of Data

Volume - amount of data
Velocity - speed at which data is generated and processed
Variety - different forms of data, e.g., structured vs unstructured, text vs numerical).
Veracity - quality/accuracy of data
Value - worth of the data

An Example: Twin River Dancers

Is this real?

https://www.facebook.com/share/r/16ZzjQbE7D/

Clues for the answer

Different heights
Timing is different,
Precise time measurements identify 2 distinct sets of clicks
Stereo-analysis show that the clicks match the direction
Blooper clip:

https://www.facebook.com/reel/1097601701609628

Another example

https://www.economist.com/interactive/trump-approval-tracker?utm_campaign=a.the-economist-today&utm_medium=email.internal-newsletter.np&utm_source=salesforce-marketing-cloud&utm_term=6/3/2026&utm_id=2191707

Concepts

Data
Fact
Opinion
Believe
Hypothesis
Theory
Truth

Definitions:

Data: Factual information (such as measurements, statistics, or observations) used as a basis for reasoning, discussion, or calculation.
Fact: A piece of information presented as having objective reality; an event or item of information that can be verified through objective evidence, empirical observation, or rigorous proof.
Opinion: A view, judgment, or appraisal formed in the mind about a particular matter; a belief stronger than impression but less strong than positive knowledge, which cannot be conclusively proven or verified by objective evidence.
Belief: a perception that something seems true, genuine, or real, often on the basis of emotional conviction, authority, or faith, rather than on absolute empirical proof or demonstration.
Hypothesis: A tentative, testable assumption or proposition advanced to explain a phenomenon or relationship. It serves as a starting point for further investigation, experimentation, or data collection, and has not yet been thoroughly proven.
Theory: A well-substantiated, structurally sound explanation of some aspect of the natural or digital world. It incorporates facts, laws, inferences, and tested hypotheses, and is widely accepted within a field because it reliably predicts outcomes and survives repeated testing.
Truth The property of being in accord with fact or reality. In logic and philosophy, a statement is considered a truth if it accurately describes the actual state of affairs or conforms to an established, verifiable reality.

Discussion of conceptions

Truth
Insight
Information
Data
Mistake (glitch)
Error
Variation
Outlier
Falsehood
Misrepresentation

Key Disciplines

Data Mining: The process of discovering patterns, insights, and anomalies from large datasets. The information extracted can be used for dataset development, decision-making, prediction, and/or understanding.
Data Science: An interdisciplinary field that combines statistical methods, data manipulation and analysis and domain expertise to extract knowledge and insights from data. Data scientists are involved in the entire data lifecycle, from collection and cleaning to analysis, modeling, and communication of results.
Data Analytics: A multidisciplinary discipline focused on analyzing raw data to extract meaningful insights, identify patterns, and draw actionable conclusions. It combines elements of mathematics, statistics, computer programming, and business intelligence to transform disorganized, complex data streams into structured information that guides strategic decision-making.
Artificial Intelligence (AI): A broad field of computer science that focuses on creating intelligent agents, to assess the situation and take actions to advance towards achieving defined goals. The goal of AI is to enable machines to simulate human intelligence.
Machine Learning (ML): A subfield of Artificial Intelligence that focuses on developing algorithms that allow computers to “learn” from data without being explicitly programmed. Instead of following rigid rules, ML models learn patterns and make predictions or decisions based on the data they are trained on. AI can be either supervised, unsupervised, and reinforcement learning.
Deep Learning (DL): A subfield of Machine Learning that uses artificial neural networks with multiple layers. Deep learning excels at learning complex patterns from large datasets, used in areas like image recognition, speech recognition, and natural language processing.

Goals of data analytics:

Descriptive (What happened?)
Diagnostic (Why did it happen?)
Predictive (What is likely to happen?)
Prescriptive (How should we respond?)

Classification & Prediction

These concepts are fundamental to supervised learning, where the model learns from labeled historical data to predict outcomes for new, unseen data.

Training vs. Testing Sets

Training Set: A partition of the dataset used to build and train the data mining model. The algorithm searches for patterns, rules, or mathematical functions within this data.
Testing Set: A separate partition of the dataset, completely independent of the training set, used exclusively to evaluate the final model’s performance and accuracy.

Decision Tree:

A flowchart-like tree structure used for classification or regression.
Each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or final decision.

Measuring errors

Variance: The range values in a normal distribution

Overfitting: A modeling error that occurs when an algorithm captures the noise or random fluctuations in the training dataset rather than the underlying data distribution. An overfitted model performs exceptionally well on the training data but fails to generalize to new, unseen testing data.

Confusion Matrix:

A tabular layout used to visualize and evaluate the performance of a supervised learning algorithm.

It maps actual class labels against predicted class labels, breaking down results into:

	Actual True	Actual False
Predicted True	True Positives (TP)	False Positives (FP) (Type I Error)
Predicted False	False Negatives (FN) (Type II Error)	True Negatives (TN)

Market Basket Transaction Associations

Association Rule: An implication expression of the form \(X \rightarrow Y\), where \(X\) and \(Y\) are disjoint itemsets. It signifies that if the items in set \(X\) are present in a transaction, the items in set \(Y\) are also likely to be present.
Support: A metric that measures how frequently an itemset appears in the entire database. Mathematically, for a rule \(X \rightarrow Y\):

\[\text{Support}(X \rightarrow Y) = \frac{\text{Number of transactions containing both } X \text{ and } Y}{\text{Total number of transactions}}\]

Confidence: A metric that measures how often items in \(Y\) appear in transactions that already contain \(X\). It assesses the reliability of the inference made by the rule:

\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}\]

Lift: A metric used to measure the strength of an association rule by comparing the co-occurrence of \(X\) and \(Y\) against what would be expected if they were completely independent.

\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)}\]

A lift value \(> 1\) indicates that \(X\) and \(Y\) are positively correlated, meaning the presence of \(X\) significantly increases the likelihood of \(Y\) occurring.

Data Processing StepS

Common Data Mining Goals

Description
Estimation
Classification
Clustering
Prediction
Association

What’s the BIGGEST challenge in doing real-world ML projects?

Getting quality data - 59%
Model not generalizing well - 20%
Deployment issues - 12%
Stakeholder alignment - 8%

R vs Python

Similarities

Package manager to facilitate loading and updating software libraries
Extensive collection of modules and packages for a wide range of functions (maps, data manipulation, etc.)
Active support and continued development from academic and corporate users community
Integrated Development Environment and Data Workbook

Differences

Feature	R	Python
Overview	R is a language and environment for statistical programming which includes statistical computing and data graphics.	Python is a general-purpose programming language for data analysis, scientific computing and application development. Simplify program complexity using common approaches.
Design Objective	Designed by statisticians for data analysis, modelling and representation for both batch computation and interactive websites. Designed for simplifying complex mathematics and statistics.	Designed by engineers and computer scientists to develop GUI, web and embedded hardware applications
Key applications	Forecasting, Data Visualization, Machine Learning	Data collection, Computer Vision, Data machines learning

More differences

[ R vs Python ]

Domain Dominance

Popularity (TIOBE)

https://www.tiobe.com/tiobe-index/

Software Installation

R 4.6.0
- Download: https://cran.r-project.org/bin/
RStudio
- Download: https://posit.co/download/rstudio-desktop

R Notebook

Chiang Mai weather

Source: api.open-meteo.com
- Current temp:

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true

Past week:

https://api.open-meteo.com/v1/forecast?latitude=18.7706&longitude=98.9626&hourly=temperature_2m&past_days=7&forecast_days=0&timezone=Asia%2FBangkok

R

IT408 / IT408 SC: Data Mining

IT408 / IT408SC Data Mining

Course Details

Course Goals:

Course Textbooks:

IT408 SC vs IT408

Schedule

Assessment

Grading

Late penalty

Code Assessment

Red Flags for AI-Generated Code

Ideal Software

Software Installation

Characteristics of Data

An Example: Twin River Dancers

Clues for the answer

Another example

Concepts

Definitions:

Discussion of conceptions

Key Disciplines

Similar and Related Terms:

Goals of data analytics:

Classification & Prediction

Training vs. Testing Sets

Decision Tree:

Measuring errors

Confusion Matrix:

Market Basket Transaction Associations

Data Processing StepS

Common Data Mining Goals

What’s the BIGGEST challenge in doing real-world ML projects?

R vs Python

Similarities

Differences

More differences

Domain Dominance

Popularity (TIOBE)

Software Installation

R Notebook

Chiang Mai weather

R

IT408 / IT408 SC:
Data Mining

IT408 / IT408SC
Data Mining