IT408 / IT408 SC:
Data Mining

Unit 1: Introduction

R Batzinger

2026-06-08

IT408 / IT408SC
Data Mining

Course Details

  • INSTRUCTOR: A.Dr. Robert P. Batzinger

  • COURSE DESCRIPTION: Fundamental data mining concepts, data mining technique, followed by more advanced concepts and algorithms. (R-based)

  • VENUE: PC301 Mon / Thu 9-12

  • MIDTERM: 2 Jul Jun 2026 TIME 10:00 - 12:00

  • FINAL: 24 Jul 2026 TIME 13:00 - 16:00 (C)

  • There will be a 1-hr lab test where student will be expected to create an accurate assessment of a dataset and its potential

Course Goals:

  • This course emphasizes the development of fundamental programming skills, critical analysis, and problem-solving abilities resulting in the creation of well-documented, readable, and maintainable code.
  • Students are expected to understand the code they submit and be able to explain their design choices and implementation details.
  • Grade Oral reassessment takes precedent in grading.

Course Textbooks:

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund, R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Garrett Grolemund, 2014. Hands-On Programming with R. O’Reilly Press (https://rstudio-education.github.io/hopr/)

IT408 SC vs IT408

Characteristic IT 408 SC IT 408
Midterm & Finals Same Same
3 Wed Sessions on Sentiment analysis (Optional) Required
General Lab test Required Required
Sentiment analysis lab test (Extra credit) Required
Course hours 30 45

Schedule

  • June
Su Mn Tu We Th Fr Sa
1 2 3 4 5 6
7 [8] 9 10 [11] 12 13
14 [15] 16 17 [18] 19 20
21 [22] 23 24 [25] 26 27
28 [29] 30
  • July
Su Mn Tu We Th Fr Sa
(1)* [[2]] 3 4
5 [6] 7 (8)* [9] 10 11
12 [13] 14 (15)* [16]L 17 18
21 [20] 21 22 [23] [[24]] 25
26 27 28 29 30 31
* IT408 Special Studies; L - Lab test

Assessment

  • Class Participation 5%
  • Assignments, quizzes 15%
  • Midterm Exam 40%
  • Final Exam 40%

Grading

\[\matrix{ Score & Grade \cr \hline 100 \ge 80 & A\ \ \cr 80 \ge 75 & B^+ \cr 75 \ge 70 & B\ \ \cr 70 \ge 65 & C^+ \cr 65 \ge 60 & C\ \ \cr 60 \ge 55 & D^+ \cr 55 \ge 50 & D\ \ \cr 50 \ge 0 & F \ \ \cr }\]

Late penalty

\[\matrix{ Days late & Grade reduced \\ \hline 0 & 0\% \cr 1 & 25\% \cr 2 & 50\% \cr 3 & 75\% \cr \gt 3 & 100\%\cr}\]

Code Assessment

  • Ignite Code Presentation: 5-minute with 20 slides that automatically advance class explaining key aspects of the program:
    • Describe what the program was designed to do
    • Decsribe the algorithm and key implementation choices
    • Highlight of the software
    • Describe any novel approaches taken
    • Sample input and output
    • Current limitations of the software
  • A dynamic research notebook to provide the background information for a research paper
    • Introduction: Capture the rationale, key goals and objectives of the project
    • Methodology: Descriptions and citations about the data sources and methodologies used, research steps and working code broken into documented fragments
    • Results: Comparison of research outcomes to expectation
    • Discussion: Intepretation of the results and possible future research directions

Red Flags for AI-Generated Code

Automatic point deductions may apply for:

  • Code that student cannot explain or modify
  • Sophisticated techniques not covered in class without explanation
  • Inconsistent coding style within the same assignment
  • Comments that don’t match the actual code
  • Over-engineering for simple problems
  • Generic variable names throughout (e.g., data, result, output)
  • Perfect code with no evidence of iteration or debugging
  • Ignite presentation that goes over 5min or fails to cover the essentials

Ideal Software

  1. Correctness & Functionality
  • Program is fully functional, robust, and correctly implements all specified requirements, including edge cases.
  • Handles unexpected inputs gracefully.
  • Output is accurate and consistently matches expected results.
  1. Code Quality & Readability
  • Code is exceptionally clean, well-organized, and easy to understand.
  • Follows established coding conventions (e.g., naming, indentation, spacing) consistently.
  • Uses appropriate data structures and algorithms.
  • Functions/methods are cohesive and have clear responsibilities.
  • Minimal duplication.
  1. Documentation & Comments
  • Comprehensive and clear documentation.
  • Program-level documentation (e.g., header comments explaining purpose, author, date, usage) is present and informative.
  • Functions/methods are well-documented, explaining parameters, return values, and purpose.
  • In-line comments explain complex logic effectively where necessary, avoiding redundancy.
  • Notebook file is thorough.
  1. Problem-Solving & Design
  • Demonstrates excellent understanding of the problem.
  • Solution is elegant, efficient, and well-structured, reflecting thoughtful design choices.
  • Breaks down the problem into logical, manageable components.
  1. Originality & Understanding
  • Demonstrates clear individual effort and a deep understanding of the submitted code.
  • Can articulate design decisions, explain specific lines of code, and debug effectively.
  • Code contains unique elements or a distinctive approach that strongly suggests student authorship.

Note: Professional understanding and integrity are assumed to accompany authorship and will be constantly tested throughout this course.

Software Installation

  • R 4.6.0
    • Download: https://cran.r-project.org/bin/
  • RStudio
    • Download: https://posit.co/download/rstudio-desktop

Characteristics of Data

  • Volume - amount of data
  • Velocity - speed at which data is generated and processed
  • Variety - different forms of data, e.g., structured vs unstructured, text vs numerical).
  • Veracity - quality/accuracy of data
  • Value - worth of the data

An Example: Twin River Dancers

Is this real?

https://www.facebook.com/share/r/16ZzjQbE7D/

Clues for the answer

  • Different heights
  • Timing is different,
  • Precise time measurements identify 2 distinct sets of clicks
  • Stereo-analysis show that the clicks match the direction
  • Blooper clip:

https://www.facebook.com/reel/1097601701609628

Another example

https://www.economist.com/interactive/trump-approval-tracker?utm_campaign=a.the-economist-today&utm_medium=email.internal-newsletter.np&utm_source=salesforce-marketing-cloud&utm_term=6/3/2026&utm_id=2191707

Concepts

  • Data
  • Fact
  • Opinion
  • Believe
  • Hypothesis
  • Theory
  • Truth

Definitions:

  • Data: Factual information (such as measurements, statistics, or observations) used as a basis for reasoning, discussion, or calculation.

  • Fact: A piece of information presented as having objective reality; an event or item of information that can be verified through objective evidence, empirical observation, or rigorous proof.

  • Opinion: A view, judgment, or appraisal formed in the mind about a particular matter; a belief stronger than impression but less strong than positive knowledge, which cannot be conclusively proven or verified by objective evidence.

  • Belief: a perception that something seems true, genuine, or real, often on the basis of emotional conviction, authority, or faith, rather than on absolute empirical proof or demonstration.

  • Hypothesis: A tentative, testable assumption or proposition advanced to explain a phenomenon or relationship. It serves as a starting point for further investigation, experimentation, or data collection, and has not yet been thoroughly proven.

  • Theory: A well-substantiated, structurally sound explanation of some aspect of the natural or digital world. It incorporates facts, laws, inferences, and tested hypotheses, and is widely accepted within a field because it reliably predicts outcomes and survives repeated testing.

  • Truth The property of being in accord with fact or reality. In logic and philosophy, a statement is considered a truth if it accurately describes the actual state of affairs or conforms to an established, verifiable reality.

Discussion of conceptions

  • Truth
  • Insight
  • Information
  • Data
  • Mistake (glitch)
  • Error
  • Variation
  • Outlier
  • Falsehood
  • Misrepresentation

Key Disciplines

  • Data Mining: The process of discovering patterns, insights, and anomalies from large datasets. The information extracted can be used for dataset development, decision-making, prediction, and/or understanding.

  • Data Science: An interdisciplinary field that combines statistical methods, data manipulation and analysis and domain expertise to extract knowledge and insights from data. Data scientists are involved in the entire data lifecycle, from collection and cleaning to analysis, modeling, and communication of results.

  • Data Analytics: A multidisciplinary discipline focused on analyzing raw data to extract meaningful insights, identify patterns, and draw actionable conclusions. It combines elements of mathematics, statistics, computer programming, and business intelligence to transform disorganized, complex data streams into structured information that guides strategic decision-making.

  • Artificial Intelligence (AI): A broad field of computer science that focuses on creating intelligent agents, to assess the situation and take actions to advance towards achieving defined goals. The goal of AI is to enable machines to simulate human intelligence.

  • Machine Learning (ML): A subfield of Artificial Intelligence that focuses on developing algorithms that allow computers to “learn” from data without being explicitly programmed. Instead of following rigid rules, ML models learn patterns and make predictions or decisions based on the data they are trained on. AI can be either supervised, unsupervised, and reinforcement learning.

  • Deep Learning (DL): A subfield of Machine Learning that uses artificial neural networks with multiple layers. Deep learning excels at learning complex patterns from large datasets, used in areas like image recognition, speech recognition, and natural language processing.

Goals of data analytics:

  • Descriptive (What happened?)
  • Diagnostic (Why did it happen?)
  • Predictive (What is likely to happen?)
  • Prescriptive (How should we respond?)

Classification & Prediction

These concepts are fundamental to supervised learning, where the model learns from labeled historical data to predict outcomes for new, unseen data.

Training vs. Testing Sets

  • Training Set: A partition of the dataset used to build and train the data mining model. The algorithm searches for patterns, rules, or mathematical functions within this data.

  • Testing Set: A separate partition of the dataset, completely independent of the training set, used exclusively to evaluate the final model’s performance and accuracy.

Decision Tree:

  • A flowchart-like tree structure used for classification or regression.

  • Each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or final decision.

Measuring errors

  • Variance: The range values in a normal distribution
  • Overfitting: A modeling error that occurs when an algorithm captures the noise or random fluctuations in the training dataset rather than the underlying data distribution. An overfitted model performs exceptionally well on the training data but fails to generalize to new, unseen testing data.

Confusion Matrix:

A tabular layout used to visualize and evaluate the performance of a supervised learning algorithm.

It maps actual class labels against predicted class labels, breaking down results into:

Actual True Actual False
Predicted True True Positives (TP) False Positives (FP)
(Type I Error)
Predicted False False Negatives (FN)
(Type II Error)
True Negatives (TN)

Market Basket Transaction Associations

  • Association Rule: An implication expression of the form \(X \rightarrow Y\), where \(X\) and \(Y\) are disjoint itemsets. It signifies that if the items in set \(X\) are present in a transaction, the items in set \(Y\) are also likely to be present.

  • Support: A metric that measures how frequently an itemset appears in the entire database. Mathematically, for a rule \(X \rightarrow Y\):

\[\text{Support}(X \rightarrow Y) = \frac{\text{Number of transactions containing both } X \text{ and } Y}{\text{Total number of transactions}}\]

  • Confidence: A metric that measures how often items in \(Y\) appear in transactions that already contain \(X\). It assesses the reliability of the inference made by the rule:

\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}\]

  • Lift: A metric used to measure the strength of an association rule by comparing the co-occurrence of \(X\) and \(Y\) against what would be expected if they were completely independent.

\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)}\]

  • A lift value \(> 1\) indicates that \(X\) and \(Y\) are positively correlated, meaning the presence of \(X\) significantly increases the likelihood of \(Y\) occurring.

Data Processing StepS

Common Data Mining Goals

  • Description
  • Estimation
  • Classification
  • Clustering
  • Prediction
  • Association

What’s the BIGGEST challenge in doing real-world ML projects?

  • Getting quality data - 59%
  • Model not generalizing well - 20%
  • Deployment issues - 12%
  • Stakeholder alignment - 8%

R vs Python

Similarities

  • Package manager to facilitate loading and updating software libraries

  • Extensive collection of modules and packages for a wide range of functions (maps, data manipulation, etc.)

  • Active support and continued development from academic and corporate users community

  • Integrated Development Environment and Data Workbook

Differences

Feature R Python
Overview R is a language and environment for statistical programming which includes statistical computing and data graphics. Python is a general-purpose programming language for data analysis, scientific computing and application development. Simplify program complexity using common approaches.
Design Objective Designed by statisticians for data analysis, modelling and representation for both batch computation and interactive websites. Designed for simplifying complex mathematics and statistics. Designed by engineers and computer scientists to develop GUI, web and embedded hardware applications
Key applications Forecasting, Data Visualization, Machine Learning Data collection, Computer Vision, Data machines learning

More differences

[R vs Python]

Domain Dominance

Popularity (TIOBE)

https://www.tiobe.com/tiobe-index/

Software Installation

  • R 4.6.0
    • Download: https://cran.r-project.org/bin/
  • RStudio
    • Download: https://posit.co/download/rstudio-desktop

R Notebook

Chiang Mai weather

  • Source: api.open-meteo.com

    • Current temp:

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true

  • Past week:

https://api.open-meteo.com/v1/forecast?latitude=18.7706&longitude=98.9626&hourly=temperature_2m&past_days=7&forecast_days=0&timezone=Asia%2FBangkok

R