IT408 / IT408 SC:
Data Mining

Unit 2: Intro to R and Data Loading

R Batzinger

2026-06-08

Course Textbooks:

Garrett Grolemund,(2014). Hands-On Programming with R. O’Reilly Press https://www.oreilly.com/library/view/hands-on-programming-with/9781449359089/

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund,(2026). R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Max Kuhn and Julia Silge,(2023). Tidy Modeling with R. O’Reilly Press

Peter Bruce and Andrew Bruce,(2017). Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilley Press.

Schedule

June

Su	Mn	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	[8]	9	10	[11]	12	13
14	[15]	16	17	[18]	19	20
21	[22]	23	24	[25]	26	27
28	[29]	30

July

Su	Mn	Tu	We	Th	Fr	Sa
			(1)*	[[2]]	3	4
5	[6]	7	(8)*	[9]	10	11
12	[13]	14	(15)*	[16]L	17	18
21	[20]	21	22	[23]	[[24]]	25
26	27	28	29	30	31

* IT408 Special Studies; L - Lab test

R Notebook

Title
Authors
Date
Abstract
Introduction
- The nature of the problem
- What work has been done before
- Key Research Objectives
Methodology
Results
Discussion
Conclusion
- Possible steps for future research
Bibliography

RMarkdown Text formatting


*italic* **bold** ~~strikeout~~ `code`

superscript^2^ subscript~2~

[underline]{.underline}

[small caps]{.smallcaps}

italic bold ~~strikeout~~ code

superscript² subscript₂

underline

small caps

Lists

-   Bulleted list item 1

-   Item 2

    -   Item 2a

    -   Item 2b

1.  Numbered list item 1

2.  Item 2.
    The numbers are incremented automatically in the output.

Bulleted list item 1
Item 2
- Item 2a
- Item 2b

Numbered list item 1
Item 2. The numbers are incremented automatically in the output.

Links and images


<http://example.com>

[linked phrase](http://example.com)

![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}

http://example.com

linked phrase

Quarto logo and the word quarto spelled in small case letters

optional caption text

Tables


| First Header | Second Header |
|:--------------|---------------:|
| Cell 1 | Cell 3 |
| Cell 2 | Cell 4 |

First Header	Second Header
Cell 1	Cell 3
Cell 2	Cell 4

R matrices FUNCTIONS

m = rbind(c(1,2,3), c(4,5,6))
print(m)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

m2 = cbind(c(1,2,3), c(4,5,6))
print(m2)

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

print(m*2)

     [,1] [,2] [,3]
[1,]    2    4    6
[2,]    8   10   12

print(m*m)

     [,1] [,2] [,3]
[1,]    1    4    9
[2,]   16   25   36

print(m+m)

     [,1] [,2] [,3]
[1,]    2    4    6
[2,]    8   10   12

print(t(m))

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

print(sum(m))

[1] 21

print(mean(m))

[1] 3.5

for(r in 1:3) {
  print(mean(m[,r]))
}

[1] 2.5
[1] 3.5
[1] 4.5

for (r in 1:2) {
  print(mean(m[r,]))
}

[1] 2
[1] 5

print(m %*% t(m))

     [,1] [,2]
[1,]   14   32
[2,]   32   77

fruit = c("apples","banana","cherries")
qty = c(10,3,1)
price = c(99,20,250)
groceries = data.frame(fruit,qty,price)

print(groceries)

     fruit qty price
1   apples  10    99
2   banana   3    20
3 cherries   1   250

print(qty * price)

[1] 990  60 250

groceries[,4] = qty * price

print(groceries)

     fruit qty price  V4
1   apples  10    99 990
2   banana   3    20  60
3 cherries   1   250 250

colnames(groceries) = c("fruit","qty","price","cost")

print(groceries)

     fruit qty price cost
1   apples  10    99  990
2   banana   3    20   60
3 cherries   1   250  250

print(groceries[,c(1,4)])

     fruit cost
1   apples  990
2   banana   60
3 cherries  250

print(groceries[1,4])

[1] 990

print(groceries[groceries[,4]<100,])

   fruit qty price cost
2 banana   3    20   60

Changing data types

To Numbers (Numeric/Integer): as.numeric() or as.integer()

b = c("1","2.5","3.14159")
print(b)

[1] "1"       "2.5"     "3.14159"

print(as.integer(b))

[1] 1 2 3

print(as.numeric(b))

[1] 1.00000 2.50000 3.14159

To Text (Character):

a = c(1,2,3)
print(a)

[1] 1 2 3

print(as.character(a))

[1] "1" "2" "3"

To True/False (Logical): as.logical()

c = c("1","0",TRUE,FALSE,"True","False","T","F")
print(c)

[1] "1"     "0"     "TRUE"  "FALSE" "True"  "False" "T"     "F"

print(as.logical(c))

[1]    NA    NA  TRUE FALSE  TRUE FALSE  TRUE FALSE

To Categories (Factor): as.factor()

e = c("R", "G", "B", "K", "R", "G", "G" )
print(e)

[1] "R" "G" "B" "K" "R" "G" "G"

f = as.factor(e)
print(f)

[1] R G B K R G G
Levels: B G K R

print(as.character(f))

[1] "R" "G" "B" "K" "R" "G" "G"

Transforming the values

old_vec <- c(1, 2, 3, 4)
print(old_vec * 10)

[1] 10 20 30 40

print(sqrt(old_vec))

[1] 1.000000 1.414214 1.732051 2.000000

Mapping / Recoding Specific Values

library(dplyr)

vec = c("M", "F", "F", "M")
newvec = recode(vec, "M" = "Male", "F" = "Female")
print(newvec)

[1] "Male"   "Female" "Female" "Male"

vec[vec == "M"] <- "Male"
vec[vec == "F"] <- "Female"
print(vec)

[1] "Male"   "Female" "Female" "Male"

celsius = c(-40,-20,0,20,40,60,80)
print(celsius)

[1] -40 -20   0  20  40  60  80

fahrenheit = (celsius * 9 / 5) + 32
print(fahrenheit)

[1] -40  -4  32  68 104 140 176

input = c(5, 12, 15, 20)
print(input)

[1]  5 12 15 20

# custom formula/logic to change 
my_formula = function(x) {
  if (x < 15) {
    return(x * 2)
  } else {
    return(x ^ 2)
  }
}

output = sapply(input, my_formula)
print(output)

[1]  10  24 225 400

Summing the actual values that meet a condition

print(input)

[1]  5 12 15 20

result <- sum(input[input > 12])
print(result)

[1] 35

Counting the number of values that meet a condition

print(input)

[1]  5 12 15 20

result <- sum(input > 12)
print(result)

[1] 2

print(input)

[1]  5 12 15 20

result = sum(input[input > 12 & input <20])
print(result)

[1] 15

Data file types

CSV read_csv
Excel readxl
Json read_json
SQL dbConnect; dbGetQuery
XML read_xml
YAML read_yaml

Comma Separated Values: data.CSV

File contents

Id,Name ,Age,Value
1234,Bob,73,1235
2356,Bill,76,2225
4421,George,65,3152
5319,William,54,3433

Read data file into a R dataframe

df = read.csv("data.CSV")
print(df)

    Id    Name Age Value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Excel Datafile: data.xlsx

data.xlsx Contents

Loading into dataframe

library(readxl)
df = read_excel("./data.xlsx")
print(df)

# A tibble: 4 × 4
     Id Name      Age Value
  <dbl> <chr>   <dbl> <dbl>
1  1234 Bob        73  1235
2  2356 Bill       76  2225
3  4421 George     65  3152
4  5319 William    54  3433

JavaScriptObject Notation (JSON) file: data.json

File Contents

[
  {
    "id": 1234,
    "name": "Bob",
    "age": 73,
    "value": 1235
  },
  {
    "id": 2356,
    "name": "Bill",
    "age": 76,
    "value": 2225
  },
  {
    "id": 4421,
    "name": "George",
    "age": 65,
    "value": 3152
  },
  {
    "id": 5319,
    "name": "William",
    "age": 54,
    "value": 3433
  }
]

Loading into dataframe

library(jsonlite)
df = fromJSON("data.json")
print(df)

    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Structured Query Language (SQL): data.sqlite

Dump file contents

An SQL dump file contains everything to reconstruct the SQL database.

CREATE TABLE members (
  id INTEGER PRIMARY KEY,
  name TEXT,
  age INTEGER,
  value INTEGER
);

INSERT INTO members (id, name, age, value)
   VALUES (1234, 'Bob', 73, 1235);
INSERT INTO members (id, name, age, value)
   VALUES (2356, 'Bill', 76, 2225);
INSERT INTO members (id, name, age, value)
   VALUES (4421, 'George', 65, 3152);
INSERT INTO members (id, name, age, value)
   VALUES (5319, 'William', 54, 3433);

Data Extraction from SQLite database:

The SQLite database is encoded in binary format. Access to the database requires an SQL server:

> sqlite3 data.sqlite
sqlite> select * from members;
╭──────┬─────────┬─────┬───────╮
│  id  │  name   │ age │ value │
╞══════╪═════════╪═════╪═══════╡
│ 1234 │ Bob     │  73 │  1235 │
│ 2356 │ Bill    │  76 │  2225 │
│ 4421 │ George  │  65 │  3152 │
│ 5319 │ William │  54 │  3433 │
╰──────┴─────────┴─────┴───────╯
sqlite> .exit;

Loading into dataframe

library(DBI)
library(RSQLite)

dbconnect <- dbConnect(SQLite(),"data.sqlite")

# Read the entire table into a dataframe
df <- dbReadTable(dbconnect, "members")

#  alternatively 

# Select a portion of the table into a dataframe

df <- dbGetQuery(dbconnect, "SELECT * FROM members")

# Disconnection Required when interacting with SQL
dbDisconnect(dbconnect)

print(df)

    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

XML : data.xml

File contents

<records>
  <record>
    <id>1234</id>
    <name>Bob</name>
    <age>73</age>
    <value>1235</value>
  </record>
  <record>
    <id>2356</id>
    <name>Bill</name>
    <age>76</age>
    <value>2225</value>
  </record>
  <record>
    <id>4421</id>
    <name>George</name>
    <age>65</age>
    <value>3152</value>
  </record>
  <record>
    <id>5319</id>
    <name>William</name>
    <age>54</age>
    <value>3433</value>
  </record>
</records>

Loading into dataframe

library(xml2)
library(dplyr)

xml_data = read_xml("data.xml")
cat("--- Step 1: Read file -------------\n")

--- Step 1: Read file -------------

print(xml_data)

{xml_document}
<records>
[1] <record>\n  <id>1234</id>\n  <name>Bob</name>\n  <age>73</age>\n  <value> ...
[2] <record>\n  <id>2356</id>\n  <name>Bill</name>\n  <age>76</age>\n  <value ...
[3] <record>\n  <id>4421</id>\n  <name>George</name>\n  <age>65</age>\n  <val ...
[4] <record>\n  <id>5319</id>\n  <name>William</name>\n  <age>54</age>\n  <va ...

records = xml_find_all(xml_data,"//record")
cat("--- Step 2: Parse into Records -----\n")

--- Step 2: Parse into Records -----

print(records)

{xml_nodeset (4)}
[1] <record>\n  <id>1234</id>\n  <name>Bob</name>\n  <age>73</age>\n  <value> ...
[2] <record>\n  <id>2356</id>\n  <name>Bill</name>\n  <age>76</age>\n  <value ...
[3] <record>\n  <id>4421</id>\n  <name>George</name>\n  <age>65</age>\n  <val ...
[4] <record>\n  <id>5319</id>\n  <name>William</name>\n  <age>54</age>\n  <va ...

df = data.frame(
  id = xml_text(xml_find_all(records, "./id")),
  name = xml_text(xml_find_all(records, "./name")),
  age = as.integer(xml_text(xml_find_all(records, "./age"))),
  value = as.integer(xml_text(xml_find_all(records, "./value")))
)
cat("--- Step 4: Parse into columns -----\n")

--- Step 4: Parse into columns -----

print(df)

    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

Yet Another Markup Language (YAML): data.yaml

- id: 1234
  name: Bob
  age: 73
  value: 1235
- id: 2356
  name: Bill
  age: 76
  value: 2225
- id: 4421
  name: George
  age: 65
  value: 3152
- id: 5319
  name: William
  age: 54
  value: 3433

Loading into dataframe

library(yaml)

yaml_list = read_yaml("data.yaml")
df = as.data.frame(do.call(rbind,yaml_list))
print(df)

    id    name age value
1 1234     Bob  73  1235
2 2356    Bill  76  2225
3 4421  George  65  3152
4 5319 William  54  3433

File Contents:

- id: 1234
  name: Bob
  age: 73
  value: 1235
- id: 2356
  name: Bill
  age: 76
  value: 2225
- id: 4421
  name: George
  age: 65
  value: 3152
- id: 5319
  name: William
  age: 54
  value: 3433

Data Sources

Weather APIs: api.open-meteo.com
Kaggle: https://kaggle.com
GitHub: https://github.com
Data.gov: https://data.gov.in
EU Open Data Portal: https://data.europa.eu/en
UCI Machine Learning Repository: https://archive.ics.uci.edu/
Hugging Face Dataset: https://huggingface.co/datasets
Open Data on AWS: https://registry.opendata.aws/
Harvard Data verse: https://data.harvard.edu/dataverse
PhysioNet: https://physionet.org/
World Bank Open Data: https://data.worldbank.org/
Federal Reserve Economic Data: https://fred.stlouisfed.org/
GNU Regression, Econometrics and Time-series Library: https://gretl.sourceforge.net/

Commercial Data Networks

STAT: https://www.stata.com/
Flight tracker: flightaware.com
Maritine Traffic: marinetraffic.com, vesselfinder.com

Data Dashboards

Microsoft Power BI
Tableau
SAP
IBM Dashboard (SPSS)
Splunk
Data.world
Bit Bucket
Google Docs
Dropbox

Weather example

Source: api.open-meteo.com
Current temp: at CNX Airport

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true

Past week:

https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&past_days=7&forecast_days=0&hourly=temperature_2m&daily=sunrise,sunset

historical archive:

https://archive-api.open-meteo.com/v1/archive?latitude=18&longitude=98&start_date=2026-06-15&end_date=2026-06-16&hourly=temperature_2m,wind_speed_10m,wind_direction_10m,rain,precipitation,soil_temperature_0cm,soil_moisture_0_to_1cm,direct_radiation,weather_code,surface_pressure&daily=sunrise,sunset

download.file(“https://api.open-meteo.com/v1/forecast?latitude=18.7668&longitude=98.9626&current_weather=true”, “temp.txt”)

https://api.open-meteo.com/v1/archive?latitude=18.7668&longitude=98.9626&start_date=2026-05-28&end_date=2026-06-11&hourly=temperature_2m,visibility,surface_pressure,weather_cod,rain,apparent_temperature,relative_humidity_2m,winddirection_10m,wind_speed_10m,cloud_cover

Downloaded file structure

library(jsonlite)
df = fromJSON("https://archive-api.open-meteo.com/v1/archive?latitude=18.7668&longitude=98.9626&start_date=2026-05-28&end_date=2026-06-11&hourly=temperature_2m,surface_pressure,weather_code,rain,apparent_temperature,relative_humidity_2m,winddirection_10m,wind_speed_10m,cloud_cover&daily=sunrise,sunset")
cat("-- Combined Dataframe ---------\n")

-- Combined Dataframe ---------

str(df)

List of 11
 $ latitude             : num 18.7
 $ longitude            : num 99
 $ generationtime_ms    : num 10.1
 $ utc_offset_seconds   : int 0
 $ timezone             : chr "GMT"
 $ timezone_abbreviation: chr "GMT"
 $ elevation            : num 310
 $ hourly_units         :List of 10
  ..$ time                : chr "iso8601"
  ..$ temperature_2m      : chr "°C"
  ..$ surface_pressure    : chr "hPa"
  ..$ weather_code        : chr "wmo code"
  ..$ rain                : chr "mm"
  ..$ apparent_temperature: chr "°C"
  ..$ relative_humidity_2m: chr "%"
  ..$ winddirection_10m   : chr "°"
  ..$ wind_speed_10m      : chr "km/h"
  ..$ cloud_cover         : chr "%"
 $ hourly               :List of 10
  ..$ time                : chr [1:360] "2026-05-28T00:00" "2026-05-28T01:00" "2026-05-28T02:00" "2026-05-28T03:00" ...
  ..$ temperature_2m      : num [1:360] 25.4 26.6 28 29.1 30.1 30.9 32.6 33.2 31 30.1 ...
  ..$ surface_pressure    : num [1:360] 975 975 976 976 975 ...
  ..$ weather_code        : int [1:360] 2 2 3 2 3 2 1 1 53 53 ...
  ..$ rain                : num [1:360] 0 0 0 0 0 0 0 0 0.9 0.7 ...
  ..$ apparent_temperature: num [1:360] 31 32.4 33.5 35.5 37 38 38.8 39.4 35.8 34.9 ...
  ..$ relative_humidity_2m: int [1:360] 92 86 78 73 67 63 53 53 62 68 ...
  ..$ winddirection_10m   : int [1:360] 222 176 110 124 151 180 188 193 195 218 ...
  ..$ wind_speed_10m      : num [1:360] 2.7 2.5 2.7 1.9 2.3 2.9 5.4 6.3 13.1 7.8 ...
  ..$ cloud_cover         : int [1:360] 63 78 95 78 81 68 47 49 85 70 ...
 $ daily_units          :List of 3
  ..$ time   : chr "iso8601"
  ..$ sunrise: chr "iso8601"
  ..$ sunset : chr "iso8601"
 $ daily                :List of 3
  ..$ time   : chr [1:15] "2026-05-28" "2026-05-29" "2026-05-30" "2026-05-31" ...
  ..$ sunrise: chr [1:15] "2026-05-27T22:46" "2026-05-28T22:46" "2026-05-29T22:46" "2026-05-30T22:46" ...
  ..$ sunset : chr [1:15] "2026-05-28T11:55" "2026-05-29T11:56" "2026-05-30T11:56" "2026-05-31T11:56" ...

Hourly Dataframe structure

hourly = as.data.frame(df$hourly)
print(str(hourly))

'data.frame':   360 obs. of  10 variables:
 $ time                : chr  "2026-05-28T00:00" "2026-05-28T01:00" "2026-05-28T02:00" "2026-05-28T03:00" ...
 $ temperature_2m      : num  25.4 26.6 28 29.1 30.1 30.9 32.6 33.2 31 30.1 ...
 $ surface_pressure    : num  975 975 976 976 975 ...
 $ weather_code        : int  2 2 3 2 3 2 1 1 53 53 ...
 $ rain                : num  0 0 0 0 0 0 0 0 0.9 0.7 ...
 $ apparent_temperature: num  31 32.4 33.5 35.5 37 38 38.8 39.4 35.8 34.9 ...
 $ relative_humidity_2m: int  92 86 78 73 67 63 53 53 62 68 ...
 $ winddirection_10m   : int  222 176 110 124 151 180 188 193 195 218 ...
 $ wind_speed_10m      : num  2.7 2.5 2.7 1.9 2.3 2.9 5.4 6.3 13.1 7.8 ...
 $ cloud_cover         : int  63 78 95 78 81 68 47 49 85 70 ...
NULL

Daily Dataframe

daily = as.data.frame(df$daily)
print(str(daily))

'data.frame':   15 obs. of  3 variables:
 $ time   : chr  "2026-05-28" "2026-05-29" "2026-05-30" "2026-05-31" ...
 $ sunrise: chr  "2026-05-27T22:46" "2026-05-28T22:46" "2026-05-29T22:46" "2026-05-30T22:46" ...
 $ sunset : chr  "2026-05-28T11:55" "2026-05-29T11:56" "2026-05-30T11:56" "2026-05-31T11:56" ...
NULL

Hourly Readings

Daily averages

Weekly averages

wk1 =data.frame(ptdata[1:7,])
wk2 =data.frame(ptdata[8:14,])
par(mfcol=c(1,2))
boxplot(wk1,ylim=c(22.5,30),col=colors,
  main="Temperate by Elevation\nWeek1")
legend(.5,25.5,c('-54cm','-18cm','-6cm',
  '0cm','2m','80m','120m','180m'),
  title="Elevation",
  fill=colors[1:8])
boxplot(wk2,ylim=c(22.5,30),
  main="Temperate by Elevation\nWeek2",
  col=colors)
legend(.5,25.5,c('-54cm','-18cm','-6cm',
'0cm','2m','80m','120m','180m'),
title="Elevation",fill=colors[1:8])

R code

Weather codes

Code	Description
0	Clear sky
1, 2, 3	Mainly clear, partly cloudy, and overcast
45, 48	Fog and depositing rime fog
51, 53, 55	Drizzle: Light, moderate, and dense intensity
56, 57	Freezing Drizzle: Light and dense intensity
61, 63, 65	Rain: Slight, moderate and heavy intensity
66, 67	Freezing Rain: Light and heavy intensity
71, 73, 75	Snow fall: Slight, moderate, and heavy intensity
77	Snow grains
80, 81, 82	Rain showers: Slight, moderate, and violent
85, 86	Snow showers slight and heavy
95 *	Thunderstorm: Slight or moderate
96, 99 *	Thunderstorm with slight and heavy hail

IT408 / IT408 SC: Data Mining

Course Textbooks:

Schedule

R Notebook

RMarkdown Text formatting

Lists

Links and images

Tables

R matrices FUNCTIONS

Changing data types

Data file types

Comma Separated Values: data.CSV

File contents

Read data file into a R dataframe

Excel Datafile: data.xlsx

data.xlsx Contents

Loading into dataframe

JavaScriptObject Notation (JSON) file: data.json

File Contents

Loading into dataframe

Structured Query Language (SQL): data.sqlite

Dump file contents

Data Extraction from SQLite database:

Loading into dataframe

XML : data.xml

File contents

Loading into dataframe

Yet Another Markup Language (YAML): data.yaml

Loading into dataframe

File Contents:

Data Sources

Weather example

Downloaded file structure

Hourly Dataframe structure

Daily Dataframe

Hourly Readings

Daily averages

Weekly averages

R code

Weather codes

IT408 / IT408 SC:
Data Mining