IT408 / IT408 SC:
Data Mining

Unit 2: Intro to R and Data Loading

R Batzinger

2026-06-08

Course Textbooks:

Garrett Grolemund,(2014). Hands-On Programming with R. O’Reilly Press https://www.oreilly.com/library/view/hands-on-programming-with/9781449359089/

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund,(2026). R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Max Kuhn and Julia Silge,(2023). Tidy Modeling with R. O’Reilly Press

Peter Bruce and Andrew Bruce,(2017). Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilley Press.

Schedule

  • June
Su Mn Tu We Th Fr Sa
1 2 3 4 5 6
7 [8] 9 10 [11] 12 13
14 [15] 16 17 [18] 19 20
21 [22] 23 24 [25] 26 27
28 [29] 30
  • July
Su Mn Tu We Th Fr Sa
(1)* [[2]] 3 4
5 [6] 7 (8)* [9] 10 11
12 [13] 14 (15)* [16]L 17 18
21 [20] 21 22 [23] [[24]] 25
26 27 28 29 30 31
* IT408 Special Studies; L - Lab test

R Notebook

  • Title
  • Authors
  • Date
  • Abstract
  • Introduction
    • The nature of the problem
    • What work has been done before
    • Key Research Objectives
  • Methodology
  • Results
  • Discussion
  • Conclusion
    • Possible steps for future research
  • Bibliography

RMarkdown Text formatting


*italic* **bold** ~~strikeout~~ `code`

superscript^2^ subscript~2~

[underline]{.underline}

[small caps]{.smallcaps}

italic bold strikeout code

superscript2 subscript2

underline

small caps

Lists

-   Bulleted list item 1

-   Item 2

    -   Item 2a

    -   Item 2b

1.  Numbered list item 1

2.  Item 2.
    The numbers are incremented automatically in the output.
  • Bulleted list item 1

  • Item 2

    • Item 2a

    • Item 2b

  1. Numbered list item 1

  2. Item 2. The numbers are incremented automatically in the output.

Tables


| First Header | Second Header |
|:--------------|---------------:|
| Cell 1 | Cell 3 |
| Cell 2 | Cell 4 |
First Header Second Header
Cell 1 Cell 3
Cell 2 Cell 4

R matrices FUNCTIONS

m = rbind(c(1,2,3), c(4,5,6))
print(m)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
m2 = cbind(c(1,2,3), c(4,5,6))
print(m2)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
print(m*2)
     [,1] [,2] [,3]
[1,]    2    4    6
[2,]    8   10   12
print(m*m)
     [,1] [,2] [,3]
[1,]    1    4    9
[2,]   16   25   36
print(m+m)
     [,1] [,2] [,3]
[1,]    2    4    6
[2,]    8   10   12
print(t(m))
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
print(sum(m))
[1] 21
print(mean(m))
[1] 3.5
for(r in 1:3) {
  print(mean(m[,r]))
}
[1] 2.5
[1] 3.5
[1] 4.5
for (r in 1:2) {
  print(mean(m[r,]))
}
[1] 2
[1] 5
print(m %*% t(m))
     [,1] [,2]
[1,]   14   32
[2,]   32   77
fruit = c("apples","banana","cherries")
qty = c(10,3,1)
price = c(99,20,250)
groceries = data.frame(fruit,qty,price)

print(groceries)
     fruit qty price
1   apples  10    99
2   banana   3    20
3 cherries   1   250
print(qty * price)
[1] 990  60 250
groceries[,4] = qty * price

print(groceries)
     fruit qty price  V4
1   apples  10    99 990
2   banana   3    20  60
3 cherries   1   250 250
colnames(groceries) = c("fruit","qty","price","cost")

print(groceries)
     fruit qty price cost
1   apples  10    99  990
2   banana   3    20   60
3 cherries   1   250  250
print(groceries[,c(1,4)])
     fruit cost
1   apples  990
2   banana   60
3 cherries  250
print(groceries[1,4])
[1] 990
print(groceries[groceries[,4]<100,])
   fruit qty price cost
2 banana   3    20   60

Changing data types

  • To Numbers (Numeric/Integer): as.numeric() or as.integer()
b = c("1","2.5","3.14159")
print(b)
[1] "1"       "2.5"     "3.14159"
print(as.integer(b))
[1] 1 2 3
print(as.numeric(b))
[1] 1.00000 2.50000 3.14159
  • To Text (Character):
a = c(1,2,3)
print(a)
[1] 1 2 3
print(as.character(a))
[1] "1" "2" "3"
  • To True/False (Logical): as.logical()
c = c("1","0",TRUE,FALSE,"True","False","T","F")
print(c)
[1] "1"     "0"     "TRUE"  "FALSE" "True"  "False" "T"     "F"    
print(as.logical(c))
[1]    NA    NA  TRUE FALSE  TRUE FALSE  TRUE FALSE
  • To Categories (Factor): as.factor()
e = c("R", "G", "B", "K", "R", "G", "G" )
print(e)
[1] "R" "G" "B" "K" "R" "G" "G"
f = as.factor(e)
print(f)
[1] R G B K R G G
Levels: B G K R
print(as.character(f))
[1] "R" "G" "B" "K" "R" "G" "G"
  • Transforming the values
old_vec <- c(1, 2, 3, 4)
print(old_vec * 10)
[1] 10 20 30 40
print(sqrt(old_vec))
[1] 1.000000 1.414214 1.732051 2.000000
  • Mapping / Recoding Specific Values
library(dplyr)

vec = c("M", "F", "F", "M")
newvec = recode(vec, "M" = "Male", "F" = "Female")
print(newvec)
[1] "Male"   "Female" "Female" "Male"  
vec[vec == "M"] <- "Male"
vec[vec == "F"] <- "Female"
print(vec)
[1] "Male"   "Female" "Female" "Male"  
celsius = c(-40,-20,0,20,40,60,80)
print(celsius)
[1] -40 -20   0  20  40  60  80
fahrenheit = (celsius * 9 / 5) + 32
print(fahrenheit)
[1] -40  -4  32  68 104 140 176
input = c(5, 12, 15, 20)
print(input)
[1]  5 12 15 20
# custom formula/logic to change 
my_formula = function(x) {
  if (x < 15) {
    return(x * 2)
  } else {
    return(x ^ 2)
  }
}

output = sapply(input, my_formula)
print(output)
[1]  10  24 225 400
  • Summing the actual values that meet a condition
print(input)
[1]  5 12 15 20
result <- sum(input[input > 12])
print(result)
[1] 35
  • Counting the number of values that meet a condition
print(input)
[1]  5 12 15 20
result <- sum(input > 12)
print(result)
[1] 2
print(input)
[1]  5 12 15 20
result = sum(input[input > 12 & input <20])
print(result)
[1] 15

Data file types

  • CSV read_csv
  • Excel readxl
  • Json read_json
  • SQL dbConnect; dbGetQuery
  • XML read_xml
  • YAML read_yaml

Comma Separated Values: data.CSV

File contents

Id,Name ,Age,Value
1234,Bob,73,1235
2356,Bill,76,2225
4421,George,65,3152
5319,William,54,3433

Read data file into a R dataframe

df = read.csv("data.CSV")
print(df)
    Id    Name Age Value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Excel Datafile: data.xlsx

data.xlsx Contents

Loading into dataframe

library(readxl)
df = read_excel("./data.xlsx")
print(df)
# A tibble: 4 × 4
     Id Name      Age Value
  <dbl> <chr>   <dbl> <dbl>
1  1234 Bob        73  1235
2  2356 Bill       76  2225
3  4421 George     65  3152
4  5319 William    54  3433

JavaScriptObject Notation (JSON) file: data.json

File Contents

[
  {
    "id": 1234,
    "name": "Bob",
    "age": 73,
    "value": 1235
  },
  {
    "id": 2356,
    "name": "Bill",
    "age": 76,
    "value": 2225
  },
  {
    "id": 4421,
    "name": "George",
    "age": 65,
    "value": 3152
  },
  {
    "id": 5319,
    "name": "William",
    "age": 54,
    "value": 3433
  }
]

Loading into dataframe

library(jsonlite)
df = fromJSON("data.json")
print(df)
    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Structured Query Language (SQL): data.sqlite

Dump file contents

An SQL dump file contains everything to reconstruct the SQL database.

CREATE TABLE members (
  id INTEGER PRIMARY KEY,
  name TEXT,
  age INTEGER,
  value INTEGER
);

INSERT INTO members (id, name, age, value)
   VALUES (1234, 'Bob', 73, 1235);
INSERT INTO members (id, name, age, value)
   VALUES (2356, 'Bill', 76, 2225);
INSERT INTO members (id, name, age, value)
   VALUES (4421, 'George', 65, 3152);
INSERT INTO members (id, name, age, value)
   VALUES (5319, 'William', 54, 3433);

Data Extraction from SQLite database:

The SQLite database is encoded in binary format. Access to the database requires an SQL server:

> sqlite3 data.sqlite
sqlite> select * from members;
╭──────┬─────────┬─────┬───────╮
│  id  │  name   │ age │ value │
╞══════╪═════════╪═════╪═══════╡
│ 1234 │ Bob     │  73 │  1235 │
│ 2356 │ Bill    │  76 │  2225 │
│ 4421 │ George  │  65 │  3152 │
│ 5319 │ William │  54 │  3433 │
╰──────┴─────────┴─────┴───────╯
sqlite> .exit;

Loading into dataframe

library(DBI)
library(RSQLite)

dbconnect <- dbConnect(SQLite(),"data.sqlite")

# Read the entire table into a dataframe
df <- dbReadTable(dbconnect, "members")

#  alternatively 

# Select a portion of the table into a dataframe

df <- dbGetQuery(dbconnect, "SELECT * FROM members")

# Disconnection Required when interacting with SQL
dbDisconnect(dbconnect)

print(df)
    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

XML : data.xml

File contents

<records>
  <record>
    <id>1234</id>
    <name>Bob</name>
    <age>73</age>
    <value>1235</value>
  </record>
  <record>
    <id>2356</id>
    <name>Bill</name>
    <age>76</age>
    <value>2225</value>
  </record>
  <record>
    <id>4421</id>
    <name>George</name>
    <age>65</age>
    <value>3152</value>
  </record>
  <record>
    <id>5319</id>
    <name>William</name>
    <age>54</age>
    <value>3433</value>
  </record>
</records>

Loading into dataframe

library(xml2)
library(dplyr)

xml_data = read_xml("data.xml")
cat("--- Step 1: Read file -------------\n")
--- Step 1: Read file -------------
print(xml_data)
{xml_document}
<records>
[1] <record>\n  <id>1234</id>\n  <name>Bob</name>\n  <age>73</age>\n  <value> ...
[2] <record>\n  <id>2356</id>\n  <name>Bill</name>\n  <age>76</age>\n  <value ...
[3] <record>\n  <id>4421</id>\n  <name>George</name>\n  <age>65</age>\n  <val ...
[4] <record>\n  <id>5319</id>\n  <name>William</name>\n  <age>54</age>\n  <va ...
records = xml_find_all(xml_data,"//record")
cat("--- Step 2: Parse into Records -----\n")
--- Step 2: Parse into Records -----
print(records)
{xml_nodeset (4)}
[1] <record>\n  <id>1234</id>\n  <name>Bob</name>\n  <age>73</age>\n  <value> ...
[2] <record>\n  <id>2356</id>\n  <name>Bill</name>\n  <age>76</age>\n  <value ...
[3] <record>\n  <id>4421</id>\n  <name>George</name>\n  <age>65</age>\n  <val ...
[4] <record>\n  <id>5319</id>\n  <name>William</name>\n  <age>54</age>\n  <va ...
df = data.frame(
  id = xml_text(xml_find_all(records, "./id")),
  name = xml_text(xml_find_all(records, "./name")),
  age = as.integer(xml_text(xml_find_all(records, "./age"))),
  value = as.integer(xml_text(xml_find_all(records, "./value")))
)
cat("--- Step 4: Parse into columns -----\n")
--- Step 4: Parse into columns -----
print(df) 
    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Yet Another Markup Language (YAML): data.yaml

- id: 1234
  name: Bob
  age: 73
  value: 1235
- id: 2356
  name: Bill
  age: 76
  value: 2225
- id: 4421
  name: George
  age: 65
  value: 3152
- id: 5319
  name: William
  age: 54
  value: 3433

Loading into dataframe

library(yaml)

yaml_list = read_yaml("data.yaml")
df = as.data.frame(do.call(rbind,yaml_list))
print(df)
    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

File Contents:

- id: 1234
  name: Bob
  age: 73
  value: 1235
- id: 2356
  name: Bill
  age: 76
  value: 2225
- id: 4421
  name: George
  age: 65
  value: 3152
- id: 5319
  name: William
  age: 54
  value: 3433

Data Sources

  • Weather APIs: api.open-meteo.com
  • Kaggle: https://kaggle.com
  • GitHub: https://github.com
  • Data.gov: https://data.gov.in
  • EU Open Data Portal: https://data.europa.eu/en
  • UCI Machine Learning Repository: https://archive.ics.uci.edu/
  • Hugging Face Dataset: https://huggingface.co/datasets
  • Open Data on AWS: https://registry.opendata.aws/
  • Harvard Data verse: https://data.harvard.edu/dataverse
  • PhysioNet: https://physionet.org/
  • World Bank Open Data: https://data.worldbank.org/
  • Federal Reserve Economic Data: https://fred.stlouisfed.org/
  • GNU Regression, Econometrics and Time-series Library: https://gretl.sourceforge.net/

Commercial Data Networks

  • STAT: https://www.stata.com/
  • Flight tracker: flightaware.com
  • Maritine Traffic: marinetraffic.com, vesselfinder.com

Data Dashboards

  • Microsoft Power BI
  • Tableau
  • SAP
  • IBM Dashboard (SPSS)
  • Splunk
  • Data.world
  • Bit Bucket
  • Google Docs
  • Dropbox

Weather example

  • Source: api.open-meteo.com

  • Current temp: at CNX Airport

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true

  • Past week:

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&past_days=7&forecast_days=0&hourly=temperature_2m&daily=sunrise,sunset

  • historical archive:

https://archive-api.open-meteo.com/v1/archive?latitude=18&longitude=98&start_date=2026-06-15&end_date=2026-06-16&hourly=temperature_2m,wind_speed_10m,wind_direction_10m,rain,precipitation,soil_temperature_0cm,soil_moisture_0_to_1cm,direct_radiation,weather_code,surface_pressure&daily=sunrise,sunset

download.file(“https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true”, “temp.txt”)

https://api.open-meteo.com/v1/archive?latitude=18.7668&longitude=98.9626&start_date=2026-05-28&end_date=2026-06-11&hourly=temperature_2m,visibility,surface_pressure,weather_cod,rain,apparent_temperature,relative_humidity_2m,winddirection_10m,wind_speed_10m,cloud_cover

Downloaded file structure

library(jsonlite)
df = fromJSON("https://archive-api.open-meteo.com/v1/archive?latitude=18.7668&longitude=98.9626&start_date=2026-05-28&end_date=2026-06-11&hourly=temperature_2m,surface_pressure,weather_code,rain,apparent_temperature,relative_humidity_2m,winddirection_10m,wind_speed_10m,cloud_cover&daily=sunrise,sunset")
cat("-- Combined Dataframe ---------\n")
-- Combined Dataframe ---------
str(df)
List of 11
 $ latitude             : num 18.7
 $ longitude            : num 99
 $ generationtime_ms    : num 10.1
 $ utc_offset_seconds   : int 0
 $ timezone             : chr "GMT"
 $ timezone_abbreviation: chr "GMT"
 $ elevation            : num 310
 $ hourly_units         :List of 10
  ..$ time                : chr "iso8601"
  ..$ temperature_2m      : chr "°C"
  ..$ surface_pressure    : chr "hPa"
  ..$ weather_code        : chr "wmo code"
  ..$ rain                : chr "mm"
  ..$ apparent_temperature: chr "°C"
  ..$ relative_humidity_2m: chr "%"
  ..$ winddirection_10m   : chr "°"
  ..$ wind_speed_10m      : chr "km/h"
  ..$ cloud_cover         : chr "%"
 $ hourly               :List of 10
  ..$ time                : chr [1:360] "2026-05-28T00:00" "2026-05-28T01:00" "2026-05-28T02:00" "2026-05-28T03:00" ...
  ..$ temperature_2m      : num [1:360] 25.4 26.6 28 29.1 30.1 30.9 32.6 33.2 31 30.1 ...
  ..$ surface_pressure    : num [1:360] 975 975 976 976 975 ...
  ..$ weather_code        : int [1:360] 2 2 3 2 3 2 1 1 53 53 ...
  ..$ rain                : num [1:360] 0 0 0 0 0 0 0 0 0.9 0.7 ...
  ..$ apparent_temperature: num [1:360] 31 32.4 33.5 35.5 37 38 38.8 39.4 35.8 34.9 ...
  ..$ relative_humidity_2m: int [1:360] 92 86 78 73 67 63 53 53 62 68 ...
  ..$ winddirection_10m   : int [1:360] 222 176 110 124 151 180 188 193 195 218 ...
  ..$ wind_speed_10m      : num [1:360] 2.7 2.5 2.7 1.9 2.3 2.9 5.4 6.3 13.1 7.8 ...
  ..$ cloud_cover         : int [1:360] 63 78 95 78 81 68 47 49 85 70 ...
 $ daily_units          :List of 3
  ..$ time   : chr "iso8601"
  ..$ sunrise: chr "iso8601"
  ..$ sunset : chr "iso8601"
 $ daily                :List of 3
  ..$ time   : chr [1:15] "2026-05-28" "2026-05-29" "2026-05-30" "2026-05-31" ...
  ..$ sunrise: chr [1:15] "2026-05-27T22:46" "2026-05-28T22:46" "2026-05-29T22:46" "2026-05-30T22:46" ...
  ..$ sunset : chr [1:15] "2026-05-28T11:55" "2026-05-29T11:56" "2026-05-30T11:56" "2026-05-31T11:56" ...

Hourly Dataframe structure

hourly = as.data.frame(df$hourly)
print(str(hourly))
'data.frame':   360 obs. of  10 variables:
 $ time                : chr  "2026-05-28T00:00" "2026-05-28T01:00" "2026-05-28T02:00" "2026-05-28T03:00" ...
 $ temperature_2m      : num  25.4 26.6 28 29.1 30.1 30.9 32.6 33.2 31 30.1 ...
 $ surface_pressure    : num  975 975 976 976 975 ...
 $ weather_code        : int  2 2 3 2 3 2 1 1 53 53 ...
 $ rain                : num  0 0 0 0 0 0 0 0 0.9 0.7 ...
 $ apparent_temperature: num  31 32.4 33.5 35.5 37 38 38.8 39.4 35.8 34.9 ...
 $ relative_humidity_2m: int  92 86 78 73 67 63 53 53 62 68 ...
 $ winddirection_10m   : int  222 176 110 124 151 180 188 193 195 218 ...
 $ wind_speed_10m      : num  2.7 2.5 2.7 1.9 2.3 2.9 5.4 6.3 13.1 7.8 ...
 $ cloud_cover         : int  63 78 95 78 81 68 47 49 85 70 ...
NULL

Daily Dataframe

daily = as.data.frame(df$daily)
print(str(daily))
'data.frame':   15 obs. of  3 variables:
 $ time   : chr  "2026-05-28" "2026-05-29" "2026-05-30" "2026-05-31" ...
 $ sunrise: chr  "2026-05-27T22:46" "2026-05-28T22:46" "2026-05-29T22:46" "2026-05-30T22:46" ...
 $ sunset : chr  "2026-05-28T11:55" "2026-05-29T11:56" "2026-05-30T11:56" "2026-05-31T11:56" ...
NULL

Hourly Readings

Daily averages

Weekly averages

wk1 =data.frame(ptdata[1:7,])
wk2 =data.frame(ptdata[8:14,])
par(mfcol=c(1,2))
boxplot(wk1,ylim=c(22.5,30),col=colors,
  main="Temperate by Elevation\nWeek1")
legend(.5,25.5,c('-54cm','-18cm','-6cm',
  '0cm','2m','80m','120m','180m'),
  title="Elevation",
  fill=colors[1:8])
boxplot(wk2,ylim=c(22.5,30),
  main="Temperate by Elevation\nWeek2",
  col=colors)
legend(.5,25.5,c('-54cm','-18cm','-6cm',
'0cm','2m','80m','120m','180m'),
title="Elevation",fill=colors[1:8])

R code

Weather codes

Code Description
0 Clear sky
1, 2, 3 Mainly clear, partly cloudy, and overcast
45, 48 Fog and depositing rime fog
51, 53, 55 Drizzle: Light, moderate, and dense intensity
56, 57 Freezing Drizzle: Light and dense intensity
61, 63, 65 Rain: Slight, moderate and heavy intensity
66, 67 Freezing Rain: Light and heavy intensity
71, 73, 75 Snow fall: Slight, moderate, and heavy intensity
77 Snow grains
80, 81, 82 Rain showers: Slight, moderate, and violent
85, 86 Snow showers slight and heavy
95 * Thunderstorm: Slight or moderate
96, 99 * Thunderstorm with slight and heavy hail