Unit 1: Introduction
2026-06-08
INSTRUCTOR: A.Dr. Robert P. Batzinger
COURSE DESCRIPTION: Fundamental data mining concepts, data mining technique, followed by more advanced concepts and algorithms. (R-based)
VENUE: PC301 Mon / Thu 9-12
MIDTERM: 2 Jul Jun 2026 TIME 10:00 - 12:00
FINAL: 24 Jul 2026 TIME 13:00 - 16:00 (C)
There will be a 1-hr lab test where student will be expected to create an accurate assessment of a dataset and its potential
Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund, R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press
Garrett Grolemund, 2014. Hands-On Programming with R. O’Reilly Press (https://rstudio-education.github.io/hopr/)
| Characteristic | IT 408 SC | IT 408 |
|---|---|---|
| Midterm & Finals | Same | Same |
| 3 Wed Sessions on Sentiment analysis | (Optional) | Required |
| General Lab test | Required | Required |
| Sentiment analysis lab test | (Extra credit) | Required |
| Course hours | 30 | 45 |
| Su | Mn | Tu | We | Th | Fr | Sa |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | [8] | 9 | 10 | [11] | 12 | 13 |
| 14 | [15] | 16 | 17 | [18] | 19 | 20 |
| 21 | [22] | 23 | 24 | [25] | 26 | 27 |
| 28 | [29] | 30 |
| Su | Mn | Tu | We | Th | Fr | Sa |
|---|---|---|---|---|---|---|
| (1)* | [[2]] | 3 | 4 | |||
| 5 | [6] | 7 | (8)* | [9] | 10 | 11 |
| 12 | [13] | 14 | (15)* | [16]L | 17 | 18 |
| 21 | [20] | 21 | 22 | [23] | [[24]] | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 |
| * IT408 Special Studies; L - Lab test |
\[\matrix{ Score & Grade \cr \hline 100 \ge 80 & A\ \ \cr 80 \ge 75 & B^+ \cr 75 \ge 70 & B\ \ \cr 70 \ge 65 & C^+ \cr 65 \ge 60 & C\ \ \cr 60 \ge 55 & D^+ \cr 55 \ge 50 & D\ \ \cr 50 \ge 0 & F \ \ \cr }\]
\[\matrix{ Days late & Grade reduced \\ \hline 0 & 0\% \cr 1 & 25\% \cr 2 & 50\% \cr 3 & 75\% \cr \gt 3 & 100\%\cr}\]
Automatic point deductions may apply for:
Note: Professional understanding and integrity are assumed to accompany authorship and will be constantly tested throughout this course.
Is this real?
https://www.facebook.com/share/r/16ZzjQbE7D/
https://www.facebook.com/reel/1097601701609628
https://www.economist.com/interactive/trump-approval-tracker?utm_campaign=a.the-economist-today&utm_medium=email.internal-newsletter.np&utm_source=salesforce-marketing-cloud&utm_term=6/3/2026&utm_id=2191707
Data: Factual information (such as measurements, statistics, or observations) used as a basis for reasoning, discussion, or calculation.
Fact: A piece of information presented as having objective reality; an event or item of information that can be verified through objective evidence, empirical observation, or rigorous proof.
Opinion: A view, judgment, or appraisal formed in the mind about a particular matter; a belief stronger than impression but less strong than positive knowledge, which cannot be conclusively proven or verified by objective evidence.
Belief: a perception that something seems true, genuine, or real, often on the basis of emotional conviction, authority, or faith, rather than on absolute empirical proof or demonstration.
Hypothesis: A tentative, testable assumption or proposition advanced to explain a phenomenon or relationship. It serves as a starting point for further investigation, experimentation, or data collection, and has not yet been thoroughly proven.
Theory: A well-substantiated, structurally sound explanation of some aspect of the natural or digital world. It incorporates facts, laws, inferences, and tested hypotheses, and is widely accepted within a field because it reliably predicts outcomes and survives repeated testing.
Truth The property of being in accord with fact or reality. In logic and philosophy, a statement is considered a truth if it accurately describes the actual state of affairs or conforms to an established, verifiable reality.
Data Mining: The process of discovering patterns, insights, and anomalies from large datasets. The information extracted can be used for dataset development, decision-making, prediction, and/or understanding.
Data Science: An interdisciplinary field that combines statistical methods, data manipulation and analysis and domain expertise to extract knowledge and insights from data. Data scientists are involved in the entire data lifecycle, from collection and cleaning to analysis, modeling, and communication of results.
Data Analytics: A multidisciplinary discipline focused on analyzing raw data to extract meaningful insights, identify patterns, and draw actionable conclusions. It combines elements of mathematics, statistics, computer programming, and business intelligence to transform disorganized, complex data streams into structured information that guides strategic decision-making.
Artificial Intelligence (AI): A broad field of computer science that focuses on creating intelligent agents, to assess the situation and take actions to advance towards achieving defined goals. The goal of AI is to enable machines to simulate human intelligence.
Machine Learning (ML): A subfield of Artificial Intelligence that focuses on developing algorithms that allow computers to “learn” from data without being explicitly programmed. Instead of following rigid rules, ML models learn patterns and make predictions or decisions based on the data they are trained on. AI can be either supervised, unsupervised, and reinforcement learning.
Deep Learning (DL): A subfield of Machine Learning that uses artificial neural networks with multiple layers. Deep learning excels at learning complex patterns from large datasets, used in areas like image recognition, speech recognition, and natural language processing.
Big Data: Refers to extremely large datasets that may be analyzed computationally to reveal subtle patterns, trends, and associations, especially relating to human behavior and interactions. These datasets often exceed the size and speed of memory.
Business Intelligence (BI): A set of strategies and technologies used for the data analysis of business information to provide actionable insights that help organizations make better business decisions. Reports and dashboards used to focused on descriptive analytics (what happened) rather than predictive or prescriptive analytics.
Natural Language Processing (NLP): A subfield of AI that focuses on enabling computers to understand, interpret, and generate human language. NLP techniques are used in applications like spam detection, sentiment analysis, machine translation, and chatbots.
Computer Vision: A subfield of AI that enables computers to “see” and interpret visual information such as images and videos. Applications include facial recognition, object detection, autonomous vehicles, and medical image analysis.
Predictive Analytics: A type of data analytics that uses historical data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. Its goal is to predict what will happen.
Prescriptive Analytics: The next step beyond predictive analytics, focusing on determining the best course of action for a given situation. It not only predicts what will happen but also suggests why it will happen and recommends actions to take to achieve a desired outcome or mitigate a risk.
Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. The goal is to maximize the cumulative reward over time. It’s often used in robotics, game playing (like AlphaGo), and autonomous systems.
Data Engineering: The aspect of data science that focuses on the practical applications of data collection, storage, processing, and transformation. Data engineers build and maintain data pipelines, databases, and data warehouses to allow convenient access and efficient use.
Ethics in AI/Data Science: Moral implications, societal impact, responsible development and deployment of AI and data-driven systems in attempt to address concerns like bias, fairness, privacy, transparency, and accountability.
These concepts are fundamental to supervised learning, where the model learns from labeled historical data to predict outcomes for new, unseen data.
Training Set: A partition of the dataset used to build and train the data mining model. The algorithm searches for patterns, rules, or mathematical functions within this data.
Testing Set: A separate partition of the dataset, completely independent of the training set, used exclusively to evaluate the final model’s performance and accuracy.
A flowchart-like tree structure used for classification or regression.
Each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or final decision.
A tabular layout used to visualize and evaluate the performance of a supervised learning algorithm.
It maps actual class labels against predicted class labels, breaking down results into:
| Actual True | Actual False | |
|---|---|---|
| Predicted True | True Positives (TP) | False Positives (FP) (Type I Error) |
| Predicted False | False Negatives (FN) (Type II Error) |
True Negatives (TN) |
Association Rule: An implication expression of the form \(X \rightarrow Y\), where \(X\) and \(Y\) are disjoint itemsets. It signifies that if the items in set \(X\) are present in a transaction, the items in set \(Y\) are also likely to be present.
Support: A metric that measures how frequently an itemset appears in the entire database. Mathematically, for a rule \(X \rightarrow Y\):
\[\text{Support}(X \rightarrow Y) = \frac{\text{Number of transactions containing both } X \text{ and } Y}{\text{Total number of transactions}}\]
\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}\]
\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)}\]
Package manager to facilitate loading and updating software libraries
Extensive collection of modules and packages for a wide range of functions (maps, data manipulation, etc.)
Active support and continued development from academic and corporate users community
Integrated Development Environment and Data Workbook
| Feature | R | Python |
|---|---|---|
| Overview | R is a language and environment for statistical programming which includes statistical computing and data graphics. | Python is a general-purpose programming language for data analysis, scientific computing and application development. Simplify program complexity using common approaches. |
| Design Objective | Designed by statisticians for data analysis, modelling and representation for both batch computation and interactive websites. Designed for simplifying complex mathematics and statistics. | Designed by engineers and computer scientists to develop GUI, web and embedded hardware applications |
| Key applications | Forecasting, Data Visualization, Machine Learning | Data collection, Computer Vision, Data machines learning |
[]
https://www.tiobe.com/tiobe-index/
Source: api.open-meteo.com
https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626¤t_weather=true
https://api.open-meteo.com/v1/forecast?latitude=18.7706&longitude=98.9626&hourly=temperature_2m&past_days=7&forecast_days=0&timezone=Asia%2FBangkok
IT408