IT408 Data Mining - Short Course (Python)

Unit 1: Introduction

R Batzinger

2025-07-12

IT408 Data Mining SC

## Course Details {.scrollable}

  • INSTRUCTOR: A.Dr. Robert P. Batzinger
  • MIDTERM: 27 Jun 2025 TIME 13:00 - 15:00 (C)
  • FINALS: 25 Jul 2025 TIME 09:00 - 12:00 (C)
  • VENUE: PC301 Mon / Thu 9-12 + 4 online sessions
  • COURSE DESCRIPTION: Fundamental data mining concepts, data mining technique, followed by more advanced concepts and algorithms. (Python based)
  • Course Goals: This course emphasizes the development of fundamental programming skills, problem-solving abilities, and the creation of well-documented, readable, and maintainable code. Students are expected to understand the code they submit and be able to explain their design choices and implementation details. There will 2 lab test where student will be expected to create an accurate assessment of a dataset and its potential

Schedule

  • June
Su Mn Tu We Th Fr Sa
1 h h 4 [5] 6 7
8 [9] 10 11 12 13 [14]+
15 [16] 17 18 [19]+ 20 21
22 [23]+ 24 25 26 [[27]] 28
29 [30]+
  • July
Su Mn Tu We Th Fr Sa
1 2
6 h h
13 18 [19]+
20 [21]+ 22 23 24 [[25]] 26
27 28 29 30 31
  • h = holiday
    • = Extra session
  • – = Workshop in Singapore

Study plan

  • Jun 5: Intro to course, Install Python
  • Jun 9 Intro to Python
  • Jun 14 Quiz 1 (Python), Load Datasets into Python
  • Jun 14+: Data Cleansing,Data Visualization
  • Jun 16: Quiz 2(Pandas and Graphics), Using notebooks
  • Jun 19+: Intro to Data Modeling
  • Jun 21: Linear Regression
  • Jun 23: Intro to Model testing
  • Jun 26: Other Modeling Techniques, Review
  • Jun 27: MidTerm
  • Jun 28: Quiz 3, Project Proposal
  • Jun 30+: Project Data Acquisition, Data assessment, Lab test
  • Jul 19+: Project presentations dress rehersal
  • Jul 21+: Project presentations,report submission, Review
  • Jul 25: Exam

Assessment

  • Quizzes - 25%
  • Midterm - 25%
  • Final - 25%
  • Project Presentation/Report 15%
  • Assignments 10%

Grading

\[\small\begin{matrix}{}^{100}&&{}^{80}&&{}^{75} && {}^{70} && {}^{65} && {}^{60} && {}^{55} && {}^{50} && {}^{0}\\ |&A&|&B^+&|&B&|&C^+&|&C&|&D^+&|&D&|& F&|\\ \end{matrix}\]

Late penalty

Days late 0 1 2 3
Grade reduced 0% 25% 50% 75%

Code Assessment

  • Ignite Code Presentation: 5-minute with 20 slides that automatically advance class explaining key aspects of the program:
    • Describe what the program was designed to do
    • Decsribe the algorithm and key implementation choices
    • Highlight of the software
    • Describe any novel approaches taken
    • Sample input and output
    • Current limitations of the software
  • Provide a copy of the Project Jupyter file A dynamic research notebook to provide the background information for a research paper
    • Introduction: Capture the rationale, key goals and objectives of the project
    • Methodology: Descriptions and citations about the data sources and methodologies used, research steps and working code broken into documented fragments
    • Results: Comparison of research outcomes to expectation
    • Discussion: Intepretation of the results and possible future research directions

Red Flags for AI-Generated Code

Automatic point deductions may apply for:

  • Code that student cannot explain or modify
  • Sophisticated techniques not covered in class without explanation
  • Inconsistent coding style within the same assignment
  • Comments that don’t match the actual code
  • Over-engineering for simple problems
  • Generic variable names throughout (e.g., data, result, output)
  • Perfect code with no evidence of iteration or debugging
  • Ignite presentation that goes over 5min or fails to cover the essentials

Ideal Software

  1. Correctness & Functionality
  • Program is fully functional, robust, and correctly implements all specified requirements, including edge cases.
  • Handles unexpected inputs gracefully.
  • Output is accurate and consistently matches expected results.
  1. Code Quality & Readability
  • Code is exceptionally clean, well-organized, and easy to understand.
  • Follows established coding conventions (e.g., naming, indentation, spacing) consistently.
  • Uses appropriate data structures and algorithms.
  • Functions/methods are cohesive and have clear responsibilities.
  • Minimal duplication.
  1. Documentation & Comments
  • Comprehensive and clear documentation.
  • Program-level documentation (e.g., header comments explaining purpose, author, date, usage) is present and informative.
  • Functions/methods are well-documented, explaining parameters, return values, and purpose.
  • In-line comments explain complex logic effectively where necessary, avoiding redundancy.
  • Notebook file is thorough.
  1. Problem-Solving & Design
  • Demonstrates excellent understanding of the problem.
  • Solution is elegant, efficient, and well-structured, reflecting thoughtful design choices.
  • Breaks down the problem into logical, manageable components.
  1. Originality & Understanding (Anti-ChatGPT)
  • Demonstrates clear individual effort and a deep understanding of the submitted code.
  • Can articulate design decisions, explain specific lines of code, and debug effectively.
  • Code contains unique elements or a distinctive approach that strongly suggests student authorship.

Note: Suspected unoriginal work may be subject to academic integrity procedures outlined in the course syllabus.

What is Data?

  • Volume - amount of data
  • Velocity - speed at which data is generated and processed
  • Variety - different forms of data, e.g., structured vs unstructured, text vs numerical).
  • Veracity - quality/accuracy of data
  • Value - worth of the data

An Example

https://www.facebook.com/share/r/16ZzjQbE7D/

Concepts

  • Data
  • Fact
  • Opinion
  • Hypothesis
  • Theory
  • Truth

Discussion

  • Data
  • Information
  • Insight
  • Truth

The dark side

  • Mistake
  • Error
  • Outlier
  • Variance
  • Falsehood
  • Misrepresentation:

Key Disciplines

  • Data Mining: The process of discovering patterns, insights, and anomalies from large datasets. The information extracted can be used for dataset development, decision-making, prediction, and/or understanding.

  • Data Science: An interdisciplinary field that combines statistical methods, data manipulation and analysis and domain expertise to extract knowledge and insights from data. Data scientists are involved in the entire data lifecycle, from collection and cleaning to analysis, modeling, and communication of results.

  • Artificial Intelligence (AI): A broad field of computer science that focuses on creating intelligent agents, to assess the situation and take actions to advance towards achieving defined goals. The goal of AI is to enable machines to simulate human intelligence.

  • Machine Learning (ML): A subfield of Artificial Intelligence that focuses on developing algorithms that allow computers to “learn” from data without being explicitly programmed. Instead of following rigid rules, ML models learn patterns and make predictions or decisions based on the data they are trained on. AI can be either supervised, unsupervised, and reinforcement learning.

  • Deep Learning (DL): A subfield of Machine Learning that uses artificial neural networks with multiple layers. Deep learning excels at learning complex patterns from large datasets, used in areas like image recognition, speech recognition, and natural language processing.

Data Processing StepS

Common Data Mining Goals

  • Description
  • Estimation
  • Classification
  • Clustering
  • Prediction
  • Association

What’s the BIGGEST challenge in doing real-world ML projects?

  • Getting quality data - 59%
  • Model not generalizing well - 20%
  • Deployment issues - 12%
  • Stakeholder alignment - 8%

R vs Python

Similarities

  • Package manager to facilitate loading and updating software libraries

  • Extensive collection of modules and packages for a wide range of functions (maps, data manipulation, etc.)

  • Active support and continued development from academic and corporate users community

  • Integrated Development Environment and Data Workbook

Differences

Feature R Python
Overview R is a language and environment for statistical programming which includes statistical computing and data graphics. Python is a general-purpose programming language for data analysis, scientific computing and application development. Simplify program complexity using common approaches.
Design Objective Designed by statisticians for data analysis, modelling and representation for both batch computation and interactive websites. Designed for simplifying complex mathematics and statistics. Designed by engineers and computer scientists to develop GUI, web and embedded hardware applications
Key applications Forecasting, Data Visualization, Machine Learning Data collection, Computer Vision, Data machines learning

More differences

[R vs Python]

Domain Dominance

Popularity (TIOBE)

R Notebook

Finding Rational Estimates of Pi

**by Robert Batzinger, Emeritus Instructor,

\[\begin{equation}N = \left\lfloor D \times \pi \right\rceil\rlap{\qquad\qquad (1)}\end{equation}\] Rational estimate of Pi is calculated as

\[\begin{equation}\pi_{est} = \frac{N}{D}\rlap{\qquad\qquad (2)}\end{equation}\]

\[\begin{equation}\chi ^ 2 = \frac{(\pi_{est}-\pi)^2}{\pi}\rlap{\qquad\qquad (3)}\end{equation}\] \[\begin{equation}\epsilon = \left|\pi_{est} - \pi\right|\rlap{\qquad\qquad (4)}\end{equation}\]

Software Installation

  • Python
    • Download: https://www.python.org/downloads/
    • Update pip: python -m pip install –upgrade pip
    • Install Jupyter: pip install notebook
    • Launch Jupyter Notebook: jupyter notebook

Course Textbooks:

  • Peter Wentworth, Jeffrey Elkner, Allen B. Downey and Chris Meyers,2012.[How to Think Like a Computer Scientist GASP Logo: Learning with Python 3][https://openbookproject.net/thinkcs/python/english3e/] Green Tea Press.