R is widely used for data analysis, statistics, and data science. When dealing with big data and databases, R does not try to load everything into memory. Instead, it connects to external systems such as databases, distributed computing frameworks, and big data engines, and processes data efficiently.
This presentation demonstrates practical examples of: - Working with large datasets - Connecting R to databases - Querying data using SQL from R - Using big data tools such as Spark - Efficient data handling techniques
Big data is characterized by: - Volume: Large amounts of data - Velocity: Fast data generation - Variety: Structured and unstructured data
R handles big data by: - Chunk processing - Database-backed analysis - Distributed computing
This demonstrates how to efficiently handle and summarize large datasets in R using the data.table package. It first creates a table with one million rows containing an ID and a random numeric value. Then, it quickly calculates the average value grouped by the remainder when the ID is divided by 10. This grouping splits the data into 10 groups based on the last digit of the ID. The data.table package is designed for speed and memory efficiency, making it ideal for working with big data. This example highlights how data.table can perform fast aggregations on large datasets with simple and concise code.
install.packages("data.table")
library(data.table)
# Simulate a large dataset
dt <- data.table(
id = 1:1e6,
value = rnorm(1e6)
)
# Fast aggregation
dt[, .(mean_value = mean(value)), by = id %% 10]
# Select rows where value is greater than 2
dt[value > 2]
# Find the maximum value and the corresponding id
dt[which.max(value)]
# Select rows where id is between 1000 and 1020
dt[id >= 1000 & id <= 1020]
This R code demonstrates how to efficiently handle very large CSV files by reading the data in manageable chunks instead of loading the entire file into memory at once. It opens a file connection, reads a fixed number of rows (50,000) per iteration, processes each chunk independently by generating summary statistics, and stops automatically when the end of the file is reached. This chunk-based approach is memory-efficient, scalable, and especially useful in real-world data engineering tasks such as log processing, large dataset analysis, and ETL workflows where system memory is limited.
con <- file("movies_dataset.csv", "r")
chunk_size <- 50000
repeat {
data_chunk <- read.csv(con, nrows = chunk_size)
if (nrow(data_chunk) == 0) break
print(summary(data_chunk))
}
close(con)
DBI stands for Database Interface. It’s a standardized way in programming (especially in R) to connect to and interact with databases.
In R, the DBI package provides a common set of functions to:
Think of DBI as a bridge between your R code and any database, making it easier to work with different database systems without changing much of your code.
For example, using DBI, you can:
install.packages(c("DBI", "RSQLite"))
# Load the package (every session)
# DBI is the standard interface R uses to talk to databases
# (RSQLite is used behind the scenes)
library(DBI)
# Connect to an SQLite database
# What this does:
# • Creates a file called MSTA1207.db in your current working directory
# • If the file already exists → R connects to it
# • If it doesn't exist → SQLite creates it automatically
con <- dbConnect(RSQLite::SQLite(), "MSTA1207.db")
# Check where it's saved:
getwd()
## [1] "C:/Users/cathe/Downloads"
# Drop table if it exists (prevents error on re-run)
dbExecute(con, "DROP TABLE IF EXISTS sales")
## [1] 0
# Create a table in the database
dbExecute(con, "
CREATE TABLE sales (
id INTEGER,
product TEXT,
amount REAL,
sale_date DATE
)
")
## [1] 0
# Insert data into the table
dbExecute(con, "
INSERT INTO sales VALUES
(1, 'Laptop', 80000, '2024-01-12'),
(2, 'Phone', 50000, '2024-01-13')
")
## [1] 2
# Verify that the data is there
sales_data <- dbGetQuery(con, "SELECT * FROM sales")
print(sales_data)
## id product amount sale_date
## 1 1 Laptop 80000 2024-01-12
## 2 2 Phone 50000 2024-01-13
# Read data using SQL
sales_summary <- dbGetQuery(con, "
SELECT product, SUM(amount) AS total_sales
FROM sales
GROUP BY product
")
print(sales_summary)
## product total_sales
## 1 Laptop 80000
## 2 Phone 50000
# 1: Total sales
total_sales <- dbGetQuery(con, "
SELECT SUM(amount) AS total_sales
FROM sales
")
print(total_sales)
## total_sales
## 1 130000
# 2: Sales by product
sales_by_product <- dbGetQuery(con, "
SELECT product, SUM(amount) AS total
FROM sales
GROUP BY product
")
print(sales_by_product)
## product total
## 1 Laptop 80000
## 2 Phone 50000
# 3: Load into R for further analysis
sales_df <- dbGetQuery(con, "SELECT * FROM sales")
print(sales_df)
## id product amount sale_date
## 1 1 Laptop 80000 2024-01-12
## 2 2 Phone 50000 2024-01-13
# Key takeaways:
# ✔ Persistent storage (SQLite)
# ✔ SQL inside R
# ✔ Separation of storage & computation
# ✔ Scalable workflow (same code works with MySQL, PostgreSQL, etc.)
# Install needed packages if you haven't already
install.packages(c("DBI", "RSQLite", "dplyr"))
library(DBI)
library(RSQLite)
library(dplyr)
# Connect to an in-memory SQLite database
con_dplyr <- dbConnect(RSQLite::SQLite(), ":memory:")
# Read orders.csv
orders_df <- read.csv("orders.csv", stringsAsFactors = FALSE)
# Copy the data frame into the database as "orders" table
copy_to(con_dplyr, orders_df, "orders", temporary = FALSE, overwrite = TRUE)
# Reference the "orders" table lazily
orders_tbl <- tbl(con_dplyr, "orders")
# Example lazy query:
# Find total sales (List Price * Quantity * (1 - Discount Percent)) by Region and Category
result_lazy <- orders_tbl %>%
mutate(
Discount_Factor = 1 - Discount.Percent,
Sales = List.Price * Quantity * Discount_Factor
) %>%
group_by(Region, Category) %>%
summarise(
Total_Sales = sum(Sales, na.rm = TRUE),
Total_Quantity = sum(Quantity, na.rm = TRUE),
Average_Discount = mean(Discount.Percent, na.rm = TRUE)
) %>%
arrange(desc(Total_Sales))
# Show SQL query generated (optional)
show_query(result_lazy)
# Collect the results into R (this runs the SQL query)
result <- collect(result_lazy)
# View results
print(result)
# Disconnect from DB when done
dbDisconnect(con_dplyr)
Note: The computation happens in the database, not in R.
install.packages(c("RMySQL", "RPostgres"))
# MySQL example using RMySQL
library(DBI)
library(RMySQL)
con_mysql <- dbConnect(
RMySQL::MySQL(),
dbname = "company_db",
host = "localhost",
user = "root",
password = "password"
)
# Check connection
if (dbIsValid(con_mysql)) {
message("Connected to MySQL database successfully!")
}
# When done
dbDisconnect(con_mysql)
# Working with MySQL using RMariaDB (alternative)
install.packages("RMariaDB")
library(DBI)
library(RMariaDB)
con_maria <- dbConnect(
MariaDB(),
dbname = "movies",
host = "localhost",
user = "wycky",
password = '12345'
)
# Verify connection
dbIsValid(con_maria)
# Query example
movies_data <- dbGetQuery(con_maria, "SELECT * FROM movies LIMIT 10")
print(movies_data)
# Disconnect
dbDisconnect(con_maria)
Using sparklyr, R can work with big data in Apache Spark by connecting to a Spark cluster and performing distributed data processing while still using familiar R syntax. Sparklyr allows R to read large datasets from sources such as HDFS, apply transformations like filtering, grouping, and aggregation using dplyr-style commands, and execute these operations in parallel across the cluster. The heavy computation happens inside Spark, not in R’s memory, and only the final results are collected back into R, making sparklyr a powerful and scalable approach for big data analytics and ETL workflows in R.
install.packages("sparklyr")
library(sparklyr)
# Connect to local Spark instance
sc <- spark_connect(master = "local")
# Copy R data to Spark
spark_df <- copy_to(sc, mtcars, "cars", overwrite = TRUE)
# Perform analysis in Spark
result <- spark_df %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg))
print(result)
# Disconnect
spark_disconnect(sc)
From an R perspective, reading data from Hadoop (HDFS) means accessing large, distributed datasets without loading everything into local memory at once, often through integration with big-data frameworks like Spark. Using tools such as sparklyr or SparkR, R connects to a Hadoop cluster, requests file metadata from HDFS, and processes the data in parallel across multiple nodes rather than on a single machine. The actual computation is pushed down to the cluster, and only the required results are collected back into R, allowing R users to work efficiently with massive datasets stored in HDFS in a scalable and fault-tolerant way.
# Conceptual example
# Data is accessed through Spark or database engines, not directly loaded
library(sparklyr)
sc <- spark_connect(master = "local")
# Reading from HDFS
hdfs_data <- spark_read_csv(sc, "hdfs_data", "hdfs://path/to/data.csv")
# Process data
result <- hdfs_data %>%
filter(column_name > 100) %>%
group_by(category) %>%
summarise(total = sum(value))
collect(result)
R ETL process extracts order data from a CSV file, transforms it by cleaning column names, correcting data types, handling missing values, and creating meaningful business metrics such as revenue, cost, and profit, and then loads the processed data into a structured database table. The pipeline ensures the raw data is converted into an analysis-ready and database-friendly format, making it suitable for reporting, analytics, and downstream data science tasks while following real-world data engineering best practices.
# Load required libraries
library(readr)
library(dplyr)
library(lubridate)
library(DBI)
library(RSQLite)
# EXTRACT - Load data
orders_raw <- read_csv("orders.csv")
# View structure
str(orders_raw)
# TRANSFORM - Clean, format, enrich
orders_clean <- orders_raw %>%
# Clean column names
rename_with(~ gsub(" ", "_", tolower(.))) %>%
# Convert data types
mutate(
order_date = as.Date(order_date, format = "%Y-%m-%d"),
postal_code = as.character(postal_code),
cost_price = as.numeric(cost_price),
list_price = as.numeric(list_price),
quantity = as.integer(quantity),
discount_percent = as.numeric(discount_percent)
) %>%
# Create derived columns
mutate(
discount_amount = list_price * (discount_percent / 100),
selling_price = list_price - discount_amount,
total_cost = cost_price * quantity,
total_revenue = selling_price * quantity,
profit = total_revenue - total_cost
) %>%
# Handle missing values
filter(!is.na(order_id), !is.na(order_date))
# Preview transformed data
glimpse(orders_clean)
# LOAD - Create database connection
con_etl <- dbConnect(SQLite(), "orders_dw.sqlite")
# Load data into database
dbWriteTable(
con_etl,
"fact_orders",
orders_clean,
overwrite = TRUE
)
# Verify load
row_count <- dbGetQuery(con_etl, "SELECT COUNT(*) AS total_rows FROM fact_orders")
print(row_count)
# Close connection
dbDisconnect(con_etl)
# Read the database
library(DBI)
library(RSQLite)
con_read <- dbConnect(SQLite(), "orders_dw.sqlite")
orders_tbl <- dbGetQuery(
con_read,
"SELECT order_id, order_date, region, total_revenue, profit
FROM fact_orders
WHERE profit > 0
LIMIT 10"
)
head(orders_tbl)
dbDisconnect(con_read)
R is fully capable of working with big data when integrated with databases and distributed systems. By combining SQL, Spark, Arrow, and efficient R packages, analysts can scale from small datasets to enterprise-level big data environments.
# Clean up: Close database connection
dbDisconnect(con)