IT408 / IT408 SC:
Data Mining

Unit 2: Fundamentals of R

R Batzinger

2026-06-08

R vs Python

Similarities

  • Package manager to facilitate loading and updating software libraries

  • Extensive collection of modules and packages for a wide range of functions (maps, data manipulation, etc.)

  • Active support and continued development from academic and corporate users community

  • Integrated Development Environment and Data Workbook

Differences

Feature R Python
Overview R is a language and environment for statistical programming which includes statistical computing and data graphics. Python is a general-purpose programming language for data analysis, scientific computing and application development. Simplify program complexity using common approaches.
Design Objective Designed by statisticians for data analysis, modelling and representation for both batch computation and interactive websites. Designed for simplifying complex mathematics and statistics. Designed by engineers and computer scientists to develop GUI, web and embedded hardware applications
Key applications Forecasting, Data Visualization, Machine Learning Data collection, Computer Vision, Data machines learning

More differences

R vs Python

Domain Dominance

Popularity (TIOBE)

https://www.tiobe.com/tiobe-index/

Software Installation

  • R 4.6.0
    • Download: https://cran.r-project.org/bin/
  • RStudio
    • Download: https://posit.co/download/rstudio-desktop

R Data Types

  • Numeric (Double) The default for numbers. Includes decimals and integers unless specified otherwise. 3.14, 42
  • Integer Whole numbers. Created by adding an L suffix. 42L, -7L
  • Character Text strings. Always enclosed in single or double quotes. “Hello, R!”, ‘A’
  • Logical Boolean values. Must be fully capitalized. TRUE, FALSE (or T, F)
  • Complex Numbers with real and imaginary parts. 1 + 4i
  • Raw: Mainly used for binary data (images and signals)

R Data Structures

  • Scalar : vector of length; a= 15
  • Vectors : one dimensional, homogenous collection of values a = c(1,2,3,4)
  • List :
  • Matrices : 2 dimensional arrays. m = matrix(1:6, nrow = 2, ncol = 3)
  • Arrays : 3D stack of Matrices; a = array(1:8, dim = c(2, 2, 2))
  • Lists : 1-dimensional but heterogeneous l = list(name = "Alice", scores = c(95, 88, 100), has_passed = TRUE)
  • Data Frames - heterogenous data set df = data.frame(id = c(1,2,3), name = c("Bob","Carl","Del"), active = c(TRUE, FALSE, TRUE))

R example

[1] 1.869028
[1] 1.527639

plot(x,y2,type=“l”) hist(y2,breaks=25)


## Student Heights

Given 6 randomly selected individuals of a population, the middle 2 represent the range of $\pm1$ standard deviation


::: {.cell}
::: {.cell-output .cell-output-stdout}

[1] 1.869028



:::

::: {.cell-output .cell-output-stdout}

[1] 1.527639



:::
:::


## R Notebook {.scrollable}

- Title
- Authors
- Date
- Abstract
- Introduction
  - The nature of the problem
  - What work has been done before
  - Key Research Objectives
- Methodology
- Results
- Discussion
- Conclusion
  - Possible steps for future research
- Bibliography

## Data file types

- Excel `readxl`
- SQL `dbConnect; dbGetQuery`
- CSV `read_csv`
- XML `read_xml`
- YAML `read_yaml`
- Json `read_json`

## Data Sources {.sources}

- Weather APIs: api.open-meteo.com
- Kaggle: https://kaggle.com
- GitHub: https://github.com
- Data.gov: https://data.gov.in
- EU Open Data Portal: https://data.europa.eu/en
- UCI Machine Learning Repository: https://archive.ics.uci.edu/
- Hugging Face Dataset: https://huggingface.co/datasets
- Open Data on AWS: https://registry.opendata.aws/
- Harvard Data verse: https://data.harvard.edu/dataverse
- PhysioNet: https://physionet.org/
- World Bank Open Data: https://data.worldbank.org/
- Federal Reserve Economic Data: https://fred.stlouisfed.org/
- GNU Regression, Econometrics and Time-series Library: https://gretl.sourceforge.net/

**Commercial Data Networks**

- STAT: https://www.stata.com/
- Microsoft Power BI
- Tableau
- SAP
- IBM Dashboard (SPSS)
- Splunk
- Data.world
- Bit Bucket
- Google Docs
- Dropbox
- Flight tracker
- Marine Traffic

## Weather

- **Source:** api.open-meteo.com

  - Current temp: at CNX Airport

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true

- Past week:

https://api.open-meteo.com/v1/forecast?latitude=18.7706&longitude=98.9626&hourly=temperature_2m,windspeed,winddirection,weathercode,rain,surface_pressure&past_days=7&forecast_days=0&timezone=Asia%2FBangkok

download.file(“https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true”, “temp.txt”)

R code

Weather codes

Code Description
0 Clear sky
1, 2, 3 Mainly clear, partly cloudy, and overcast
45, 48 Fog and depositing rime fog
51, 53, 55 Drizzle: Light, moderate, and dense intensity
56, 57 Freezing Drizzle: Light and dense intensity
61, 63, 65 Rain: Slight, moderate and heavy intensity
66, 67 Freezing Rain: Light and heavy intensity
71, 73, 75 Snow fall: Slight, moderate, and heavy intensity
77 Snow grains
80, 81, 82 Rain showers: Slight, moderate, and violent
85, 86 Snow showers slight and heavy
95 * Thunderstorm: Slight or moderate
96, 99 * Thunderstorm with slight and heavy hail

Market Basket Transaction Associations

  • Association Rule: An implication expression of the form \(X \rightarrow Y\), where \(X\) and \(Y\) are disjoint itemsets. It signifies that if the items in set \(X\) are present in a transaction, the items in set \(Y\) are also likely to be present.

  • Support: A metric that measures how frequently an itemset appears in the entire database. Mathematically, for a rule \(X \rightarrow Y\):

\[\text{Support}(X \rightarrow Y) = \frac{\text{Number of transactions containing both } X \text{ and } Y}{\text{Total number of transactions}}\]

  • Confidence: A metric that measures how often items in \(Y\) appear in transactions that already contain \(X\). It assesses the reliability of the inference made by the rule:

\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}\]

  • Lift: A metric used to measure the strength of an association rule by comparing the co-occurrence of \(X\) and \(Y\) against what would be expected if they were completely independent.

\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)}\]

  • A lift value \(> 1\) indicates that \(X\) and \(Y\) are positively correlated, meaning the presence of \(X\) significantly increases the likelihood of \(Y\) occurring.