Purpose of this document: A step-by-step walkthrough
of how to import data from three major statistical software formats —
SPSS (.sav), SAS
(.sas7bdat), and Stata (.dta)
— into R, store everything in a SQLite relational
database, and query it back out.
Every section contains notes explaining why we do each
step.
📦 Step 1 — Install
& Load Required Packages
# ─────────────────────────────────────────────────────────────────────────────
# PACKAGE INSTALLATION
# install.packages() downloads a package from CRAN and installs it on your
# machine ONCE. After installation, you only need library() to load it.
#
# We use if (!require(...)) to avoid re-installing packages that are
# already present — this makes the document reproducible and fast.
#
# Packages we need:
# haven → reads SPSS (.sav), SAS (.sas7bdat), and Stata (.dta) files
# foreign → older alternative for SPSS/SAS (good backup option)
# DBI → unified database interface — the "grammar" for all databases
# RSQLite → SQLite database driver; works together with DBI
# dplyr → data manipulation (filter, select, mutate, summarise…)
# knitr → for kable() which prints beautiful HTML tables
# kableExtra → adds extra styling to kable tables
# ─────────────────────────────────────────────────────────────────────────────
packages <- c("haven", "foreign", "DBI", "RSQLite",
"dplyr", "knitr", "kableExtra")
for (pkg in packages) {
if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
library(pkg, character.only = TRUE)
}
}
# ── Confirm everything loaded correctly ──────────────────────────────────────
cat("✅ Packages loaded successfully:\n")
## ✅ Packages loaded successfully:
cat(paste(" •", packages, collapse = "\n"), "\n")
## • haven
## • foreign
## • DBI
## • RSQLite
## • dplyr
## • knitr
## • kableExtra
📝 Note: haven is the modern,
tidyverse-compatible package for reading statistical files. It preserves
SPSS/SAS/Stata labels, value labels,
and missing value codes — something base R functions
cannot do.
📂 Step 2 — Import SPSS
Data (.sav)
# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING AN SPSS FILE
#
# SPSS saves data in a proprietary binary format (.sav).
# haven::read_sav() reads this file and converts it to an R tibble (a
# modern data frame). It also attaches any variable labels and value labels
# that were stored in the SPSS file as R attributes.
#
# IMPORTANT: Replace the path below with the actual path to YOUR .sav file.
# Examples:
# Windows → "C:/Users/YourName/Documents/data_spss.sav"
# Mac/Linux → "~/Documents/data_spss.sav"
# Same folder as this .Rmd → "data_spss.sav"
# ─────────────────────────────────────────────────────────────────────────────
# ── 2a. Read the file ────────────────────────────────────────────────────────
spss_data <- haven::read_sav("data_spss.sav")
# ── 2b. Preview the raw import ───────────────────────────────────────────────
cat("── SPSS Import: Dimensions ──────────────────────────────────────────\n")
## ── SPSS Import: Dimensions ──────────────────────────────────────────
cat("Rows:", nrow(spss_data), "| Columns:", ncol(spss_data), "\n\n")
## Rows: 5 | Columns: 4
cat("── Column names and data types ─────────────────────────────────────\n")
## ── Column names and data types ─────────────────────────────────────
## Rows: 5
## Columns: 4
## $ id <dbl> 29013, 28285, 28924, 26003, 29408
## $ name <chr> "Joy", "Peter", "Eric", "Claude", "Marry"
## $ age <dbl> 20, 19, 23, 18, 20
## $ grade <dbl> 14.0, 12.5, 15.8, 17.0, 16.5
# ── 2c. Display as a formatted table ─────────────────────────────────────────
kable(spss_data,
caption = "Table 1: Raw SPSS Data Import",
align = c("r","l","r","r")) |>
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE,
position = "left") |>
column_spec(2, bold = TRUE, color = "#2c7fb8")
Table 1: Raw SPSS Data Import
|
id
|
name
|
age
|
grade
|
|
29013
|
Joy
|
20
|
14.0
|
|
28285
|
Peter
|
19
|
12.5
|
|
28924
|
Eric
|
23
|
15.8
|
|
26003
|
Claude
|
18
|
17.0
|
|
29408
|
Marry
|
20
|
16.5
|
# ── 2d. Clean and standardise column types ────────────────────────────────────
# haven imports SPSS labelled vectors — we convert them to plain R types
# using zap_labels(). This makes the data easier to work with downstream.
spss_clean <- spss_data |>
haven::zap_labels() |> # remove SPSS label attributes
mutate(
id = as.integer(id), # SPSS stores IDs as float → convert
name = as.character(name),
age = as.integer(age),
grade = as.numeric(grade),
source = "SPSS" # track where each row came from
)
cat("✅ SPSS data cleaned. Preview:\n")
## ✅ SPSS data cleaned. Preview:
## # A tibble: 5 × 5
## id name age grade source
## <int> <chr> <int> <dbl> <chr>
## 1 29013 Joy 20 14 SPSS
## 2 28285 Peter 19 12.5 SPSS
## 3 28924 Eric 23 15.8 SPSS
## 4 26003 Claude 18 17 SPSS
## 5 29408 Marry 20 16.5 SPSS
📝 Note on zap_labels(): SPSS often
encodes categorical responses as numbers (e.g., 1 = Male, 2 = Female)
with value labels. zap_labels() drops the label metadata so
R treats the column as a plain number or character. If you want to
keep labels as factors, use as_factor()
instead.
📂 Step 3 — Import SAS
Data (.sas7bdat)
# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING A SAS FILE
#
# SAS stores data in .sas7bdat binary files. haven::read_sas() reads them
# directly — no SAS licence required!
#
# If you do NOT have a real SAS file, the block below creates a synthetic
# dataset with the same structure as the SPSS file so the rest of the
# document runs without errors. Remove the `if` block once you have a
# real .sas7bdat file.
#
# NOTE: SAS files can also carry format catalogs (.sas7bcat). If you have
# one, pass it via: read_sas("data.sas7bdat", catalog_file = "formats.sas7bcat")
# ─────────────────────────────────────────────────────────────────────────────
sas_file_path <- "data_sas.sas7bdat" # ← change to your actual path
if (file.exists(sas_file_path)) {
# Real SAS file present — read it
sas_data <- haven::read_sas(sas_file_path)
cat("✅ Real SAS file loaded from:", sas_file_path, "\n")
} else {
# No SAS file found — create a synthetic demo dataset
# (mirrors the columns in the SPSS data: id, name, age, grade)
cat("ℹ️ No SAS file found at '", sas_file_path,
"'. Creating a synthetic SAS-like demo dataset.\n", sep = "")
sas_data <- tibble::tibble(
id = c(30001L, 30002L, 30003L, 30004L),
name = c("Alice", "Bob", "Diana", "Frank"),
age = c(22L, 25L, 21L, 24L),
grade = c(15.0, 13.5, 17.2, 11.8)
)
}
## ℹ️ No SAS file found at 'data_sas.sas7bdat'. Creating a synthetic SAS-like demo dataset.
# ── Clean and add source tag ──────────────────────────────────────────────────
sas_clean <- sas_data |>
haven::zap_labels() |>
mutate(
id = as.integer(id),
name = as.character(name),
age = as.integer(age),
grade = as.numeric(grade),
source = "SAS"
)
cat("\n── SAS Data Preview ─────────────────────────────────────────────────\n")
##
## ── SAS Data Preview ─────────────────────────────────────────────────
## # A tibble: 4 × 5
## id name age grade source
## <int> <chr> <int> <dbl> <chr>
## 1 30001 Alice 22 15 SAS
## 2 30002 Bob 25 13.5 SAS
## 3 30003 Diana 21 17.2 SAS
## 4 30004 Frank 24 11.8 SAS
kable(sas_clean,
caption = "Table 2: SAS Data (real file or demo)",
align = c("r","l","r","r","l")) |>
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE) |>
column_spec(2, bold = TRUE, color = "#31a354")
Table 2: SAS Data (real file or demo)
|
id
|
name
|
age
|
grade
|
source
|
|
30001
|
Alice
|
22
|
15.0
|
SAS
|
|
30002
|
Bob
|
25
|
13.5
|
SAS
|
|
30003
|
Diana
|
21
|
17.2
|
SAS
|
|
30004
|
Frank
|
24
|
11.8
|
SAS
|
📝 Note: haven::read_sas() also reads
SAS Transport files (.xpt). These are used
by the FDA and CDC for public data releases. Use
haven::read_xpt("file.xpt") for those.
📂 Step 4 — Import
Stata Data (.dta)
# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING A STATA FILE
#
# Stata saves data as .dta files. haven::read_dta() supports Stata 8–18.
# Stata files often contain:
# - Variable labels (long descriptions of each column)
# - Value labels (text labels for numeric codes)
# - Notes (comments attached to variables)
# haven preserves all of these as R attributes.
#
# Same as SAS: a synthetic dataset is created if no real file is found.
# ─────────────────────────────────────────────────────────────────────────────
stata_file_path <- "data_stata.dta" # ← change to your actual path
if (file.exists(stata_file_path)) {
stata_data <- haven::read_dta(stata_file_path)
cat("✅ Real Stata file loaded from:", stata_file_path, "\n")
} else {
cat("ℹ️ No Stata file found at '", stata_file_path,
"'. Creating a synthetic Stata-like demo dataset.\n", sep = "")
stata_data <- tibble::tibble(
id = c(40001L, 40002L, 40003L),
name = c("Grace", "Henry", "Iris"),
age = c(20L, 27L, 23L),
grade = c(16.0, 14.5, 18.0)
)
}
## ℹ️ No Stata file found at 'data_stata.dta'. Creating a synthetic Stata-like demo dataset.
# ── Clean and add source tag ──────────────────────────────────────────────────
stata_clean <- stata_data |>
haven::zap_labels() |>
mutate(
id = as.integer(id),
name = as.character(name),
age = as.integer(age),
grade = as.numeric(grade),
source = "Stata"
)
cat("\n── Stata Data Preview ───────────────────────────────────────────────\n")
##
## ── Stata Data Preview ───────────────────────────────────────────────
## # A tibble: 3 × 5
## id name age grade source
## <int> <chr> <int> <dbl> <chr>
## 1 40001 Grace 20 16 Stata
## 2 40002 Henry 27 14.5 Stata
## 3 40003 Iris 23 18 Stata
kable(stata_clean,
caption = "Table 3: Stata Data (real file or demo)",
align = c("r","l","r","r","l")) |>
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE) |>
column_spec(2, bold = TRUE, color = "#d95f0e")
Table 3: Stata Data (real file or demo)
|
id
|
name
|
age
|
grade
|
source
|
|
40001
|
Grace
|
20
|
16.0
|
Stata
|
|
40002
|
Henry
|
27
|
14.5
|
Stata
|
|
40003
|
Iris
|
23
|
18.0
|
Stata
|
📝 Note on Stata versions: Stata 13+ uses a
different file encoding than older versions. haven handles
all modern versions automatically.
For very old Stata 5–7 files, the foreign package’s
read.dta() is a reliable fallback.
🔗 Step 5 — Combine All
Sources into One Data Frame
# ─────────────────────────────────────────────────────────────────────────────
# COMBINING ALL THREE DATASETS
#
# dplyr::bind_rows() stacks data frames vertically (like SQL UNION ALL).
# It matches columns by NAME, not position — so column order doesn't matter.
# Missing columns in any data frame are filled with NA automatically.
#
# The 'source' column we added earlier lets us trace every record back to
# its original file after the data is combined.
# ─────────────────────────────────────────────────────────────────────────────
all_data <- dplyr::bind_rows(spss_clean, sas_clean, stata_clean)
cat("── Combined Dataset ─────────────────────────────────────────────────\n")
## ── Combined Dataset ─────────────────────────────────────────────────
cat("Total rows:", nrow(all_data),
"| Sources:", paste(unique(all_data$source), collapse = ", "), "\n\n")
## Total rows: 12 | Sources: SPSS, SAS, Stata
## # A tibble: 12 × 5
## id name age grade source
## <int> <chr> <int> <dbl> <chr>
## 1 29013 Joy 20 14 SPSS
## 2 28285 Peter 19 12.5 SPSS
## 3 28924 Eric 23 15.8 SPSS
## 4 26003 Claude 18 17 SPSS
## 5 29408 Marry 20 16.5 SPSS
## 6 30001 Alice 22 15 SAS
## 7 30002 Bob 25 13.5 SAS
## 8 30003 Diana 21 17.2 SAS
## 9 30004 Frank 24 11.8 SAS
## 10 40001 Grace 20 16 Stata
## 11 40002 Henry 27 14.5 Stata
## 12 40003 Iris 23 18 Stata
kable(all_data,
caption = "Table 4: All Sources Combined",
align = c("r","l","r","r","l")) |>
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = FALSE) |>
row_spec(1:nrow(spss_clean), background = "#eaf4fb") |> # SPSS rows → blue
row_spec((nrow(spss_clean)+1):(nrow(spss_clean)+nrow(sas_clean)),
background = "#eafbea") |> # SAS rows → green
row_spec((nrow(spss_clean)+nrow(sas_clean)+1):nrow(all_data),
background = "#fdf3e7") # Stata rows → orange
Table 4: All Sources Combined
|
id
|
name
|
age
|
grade
|
source
|
|
29013
|
Joy
|
20
|
14.0
|
SPSS
|
|
28285
|
Peter
|
19
|
12.5
|
SPSS
|
|
28924
|
Eric
|
23
|
15.8
|
SPSS
|
|
26003
|
Claude
|
18
|
17.0
|
SPSS
|
|
29408
|
Marry
|
20
|
16.5
|
SPSS
|
|
30001
|
Alice
|
22
|
15.0
|
SAS
|
|
30002
|
Bob
|
25
|
13.5
|
SAS
|
|
30003
|
Diana
|
21
|
17.2
|
SAS
|
|
30004
|
Frank
|
24
|
11.8
|
SAS
|
|
40001
|
Grace
|
20
|
16.0
|
Stata
|
|
40002
|
Henry
|
27
|
14.5
|
Stata
|
|
40003
|
Iris
|
23
|
18.0
|
Stata
|
🗄️ Step 6 — Create
& Connect to a SQLite Database
# ─────────────────────────────────────────────────────────────────────────────
# WHAT IS SQLite?
# SQLite is a self-contained, serverless SQL database engine. The entire
# database is stored in a single .sqlite file on disk. It is perfect for:
# ✔ Learning SQL
# ✔ Small-to-medium datasets
# ✔ Reproducible research (one file = one database)
# ✔ Sharing data with collaborators
#
# WHAT IS DBI?
# DBI (DataBase Interface) is R's standard way to talk to databases.
# It provides a common set of functions regardless of which database you use:
# dbConnect() → open a connection
# dbWriteTable() → write an R data frame to a database table
# dbGetQuery() → run SQL and return results as a data frame
# dbListTables() → list all tables
# dbDisconnect() → close the connection (important — always do this!)
# ─────────────────────────────────────────────────────────────────────────────
# ── 6a. Create / open the database file ──────────────────────────────────────
db_path <- "students_database.sqlite"
con <- DBI::dbConnect(
drv = RSQLite::SQLite(), # which database engine to use
dbname = db_path # file path; use ":memory:" for RAM-only DB
)
cat("✅ Database connected:", db_path, "\n")
## ✅ Database connected: students_database.sqlite
cat(" Connection class:", class(con), "\n")
## Connection class: SQLiteConnection
📝 Note: Using ":memory:" instead of a
file path creates an in-memory database that disappears when R closes.
Use a file path (like above) when you want the data to
persist between R sessions.
📤 Step 7 — Write Data
to the Database
# ─────────────────────────────────────────────────────────────────────────────
# WRITING R DATA FRAMES TO SQL TABLES
#
# dbWriteTable() creates a new table (or replaces it with overwrite = TRUE)
# and inserts all rows from the R data frame.
#
# We create THREE separate tables — one per source — plus ONE combined table.
# This mirrors how a real research database is often structured:
# raw_spss → original SPSS records
# raw_sas → original SAS records
# raw_stata → original Stata records
# all_students → combined view of all three
# ─────────────────────────────────────────────────────────────────────────────
DBI::dbWriteTable(con, "raw_spss", spss_clean, overwrite = TRUE)
DBI::dbWriteTable(con, "raw_sas", sas_clean, overwrite = TRUE)
DBI::dbWriteTable(con, "raw_stata", stata_clean, overwrite = TRUE)
DBI::dbWriteTable(con, "all_students", all_data, overwrite = TRUE)
# ── Confirm tables were created ───────────────────────────────────────────────
tables_in_db <- DBI::dbListTables(con)
cat("✅ Tables now in database:\n")
## ✅ Tables now in database:
cat(paste(" •", tables_in_db, collapse = "\n"), "\n")
## • all_students
## • raw_sas
## • raw_spss
## • raw_stata
# ── Verify row counts match what we loaded ───────────────────────────────────
for (tbl in tables_in_db) {
n <- DBI::dbGetQuery(con, paste("SELECT COUNT(*) AS n FROM", tbl))$n
cat(sprintf(" %-20s %d rows\n", tbl, n))
}
## all_students 12 rows
## raw_sas 4 rows
## raw_spss 5 rows
## raw_stata 3 rows
🔍 Step 8 — Query the
Database with SQL
# ─────────────────────────────────────────────────────────────────────────────
# QUERYING WITH SQL
#
# dbGetQuery(connection, sql_string) sends a SQL SELECT statement to SQLite
# and returns the result as an R data frame — ready to use immediately.
#
# SQL cheatsheet used below:
# SELECT → choose columns
# FROM → specify the table
# WHERE → filter rows (like dplyr::filter)
# ORDER BY → sort rows
# GROUP BY → aggregate rows
# AVG() → average
# COUNT() → count rows
# ROUND() → round a number to N decimal places
# ─────────────────────────────────────────────────────────────────────────────
# ── Query 1: All students, sorted by grade descending ────────────────────────
cat("── Query 1: All students ordered by grade ───────────────────────────\n")
## ── Query 1: All students ordered by grade ───────────────────────────
q1 <- DBI::dbGetQuery(con,
"SELECT id, name, age, grade, source
FROM all_students
ORDER BY grade DESC"
)
kable(q1, caption = "Query 1 — All Students by Grade (highest first)") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Query 1 — All Students by Grade (highest first)
|
id
|
name
|
age
|
grade
|
source
|
|
40003
|
Iris
|
23
|
18.0
|
Stata
|
|
30003
|
Diana
|
21
|
17.2
|
SAS
|
|
26003
|
Claude
|
18
|
17.0
|
SPSS
|
|
29408
|
Marry
|
20
|
16.5
|
SPSS
|
|
40001
|
Grace
|
20
|
16.0
|
Stata
|
|
28924
|
Eric
|
23
|
15.8
|
SPSS
|
|
30001
|
Alice
|
22
|
15.0
|
SAS
|
|
40002
|
Henry
|
27
|
14.5
|
Stata
|
|
29013
|
Joy
|
20
|
14.0
|
SPSS
|
|
30002
|
Bob
|
25
|
13.5
|
SAS
|
|
28285
|
Peter
|
19
|
12.5
|
SPSS
|
|
30004
|
Frank
|
24
|
11.8
|
SAS
|
# ── Query 2: Students with grade ≥ 15 ────────────────────────────────────────
cat("── Query 2: High-performing students (grade ≥ 15) ───────────────────\n")
## ── Query 2: High-performing students (grade ≥ 15) ───────────────────
q2 <- DBI::dbGetQuery(con,
"SELECT id, name, age, grade, source
FROM all_students
WHERE grade >= 15
ORDER BY grade DESC"
)
kable(q2, caption = "Query 2 — Students with Grade ≥ 15") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Query 2 — Students with Grade ≥ 15
|
id
|
name
|
age
|
grade
|
source
|
|
40003
|
Iris
|
23
|
18.0
|
Stata
|
|
30003
|
Diana
|
21
|
17.2
|
SAS
|
|
26003
|
Claude
|
18
|
17.0
|
SPSS
|
|
29408
|
Marry
|
20
|
16.5
|
SPSS
|
|
40001
|
Grace
|
20
|
16.0
|
Stata
|
|
28924
|
Eric
|
23
|
15.8
|
SPSS
|
|
30001
|
Alice
|
22
|
15.0
|
SAS
|
# ── Query 3: Summary statistics per source ───────────────────────────────────
cat("── Query 3: Summary by data source ─────────────────────────────────\n")
## ── Query 3: Summary by data source ─────────────────────────────────
q3 <- DBI::dbGetQuery(con,
"SELECT source,
COUNT(*) AS total_students,
ROUND(AVG(age), 1) AS avg_age,
ROUND(AVG(grade),2) AS avg_grade,
ROUND(MIN(grade),2) AS min_grade,
ROUND(MAX(grade),2) AS max_grade
FROM all_students
GROUP BY source
ORDER BY source"
)
kable(q3, caption = "Query 3 — Summary Statistics by Source File") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)
Query 3 — Summary Statistics by Source File
|
source
|
total_students
|
avg_age
|
avg_grade
|
min_grade
|
max_grade
|
|
SAS
|
4
|
23.0
|
14.38
|
11.8
|
17.2
|
|
SPSS
|
5
|
20.0
|
15.16
|
12.5
|
17.0
|
|
Stata
|
3
|
23.3
|
16.17
|
14.5
|
18.0
|
# ── Query 4: Only the SPSS table ─────────────────────────────────────────────
cat("── Query 4: Reading directly from the raw_spss table ───────────────\n")
## ── Query 4: Reading directly from the raw_spss table ───────────────
q4 <- DBI::dbGetQuery(con,
"SELECT * FROM raw_spss ORDER BY id"
)
kable(q4, caption = "Query 4 — Raw SPSS Table") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Query 4 — Raw SPSS Table
|
id
|
name
|
age
|
grade
|
source
|
|
26003
|
Claude
|
18
|
17.0
|
SPSS
|
|
28285
|
Peter
|
19
|
12.5
|
SPSS
|
|
28924
|
Eric
|
23
|
15.8
|
SPSS
|
|
29013
|
Joy
|
20
|
14.0
|
SPSS
|
|
29408
|
Marry
|
20
|
16.5
|
SPSS
|
🛠️ Step 9 — Using dplyr
with the Database (dbplyr approach)
# ─────────────────────────────────────────────────────────────────────────────
# DPLYR + DATABASE (lazy evaluation)
#
# tbl(connection, "table_name") creates a "lazy" reference to a database table.
# You write normal dplyr code, but dplyr translates it to SQL behind the scenes
# and only fetches data when you call collect().
#
# This is powerful for LARGE tables — the filtering happens IN the database,
# not in R memory. Only the result is transferred to R.
# ─────────────────────────────────────────────────────────────────────────────
students_tbl <- dplyr::tbl(con, "all_students")
# Build a query with dplyr verbs (nothing runs yet — it's lazy)
dplyr_result <- students_tbl |>
dplyr::filter(age >= 20) |>
dplyr::select(name, age, grade, source) |>
dplyr::arrange(dplyr::desc(grade))
# See the SQL that dplyr generated
cat("── SQL generated by dplyr ───────────────────────────────────────────\n")
## ── SQL generated by dplyr ───────────────────────────────────────────
dplyr::show_query(dplyr_result)
## <SQL>
## SELECT `name`, `age`, `grade`, `source`
## FROM `all_students`
## WHERE (`age` >= 20.0)
## ORDER BY `grade` DESC
# Now actually fetch the data into R
cat("\n── Collected result ─────────────────────────────────────────────────\n")
##
## ── Collected result ─────────────────────────────────────────────────
result_df <- dplyr::collect(dplyr_result)
print(result_df)
## # A tibble: 10 × 4
## name age grade source
## <chr> <int> <dbl> <chr>
## 1 Iris 23 18 Stata
## 2 Diana 21 17.2 SAS
## 3 Marry 20 16.5 SPSS
## 4 Grace 20 16 Stata
## 5 Eric 23 15.8 SPSS
## 6 Alice 22 15 SAS
## 7 Henry 27 14.5 Stata
## 8 Joy 20 14 SPSS
## 9 Bob 25 13.5 SAS
## 10 Frank 24 11.8 SAS
kable(result_df,
caption = "dplyr Query: Students aged 20+ sorted by grade") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
dplyr Query: Students aged 20+ sorted by grade
|
name
|
age
|
grade
|
source
|
|
Iris
|
23
|
18.0
|
Stata
|
|
Diana
|
21
|
17.2
|
SAS
|
|
Marry
|
20
|
16.5
|
SPSS
|
|
Grace
|
20
|
16.0
|
Stata
|
|
Eric
|
23
|
15.8
|
SPSS
|
|
Alice
|
22
|
15.0
|
SAS
|
|
Henry
|
27
|
14.5
|
Stata
|
|
Joy
|
20
|
14.0
|
SPSS
|
|
Bob
|
25
|
13.5
|
SAS
|
|
Frank
|
24
|
11.8
|
SAS
|
📝 Note: The lazy/collect pattern is the key
difference between
dplyr on an R data frame (executes immediately) vs
dplyr on a database
(executes only when collect() is called). Always call
collect() when
you are done filtering and want to work with the result in R.
📊 Step 10 — Simple
Visualisation
# ─────────────────────────────────────────────────────────────────────────────
# We query the database one more time and plot the results.
# This shows the full pipeline: Database → R data frame → plot.
# ─────────────────────────────────────────────────────────────────────────────
plot_data <- DBI::dbGetQuery(con,
"SELECT name, grade, source FROM all_students ORDER BY grade DESC"
)
# Colour palette — one colour per source
source_colours <- c(SPSS = "#2c7fb8", SAS = "#31a354", Stata = "#d95f0e")
barplot(
height = plot_data$grade,
names.arg = plot_data$name,
col = source_colours[plot_data$source],
main = "Student Grades by Source File",
xlab = "Student Name",
ylab = "Grade",
ylim = c(0, 22),
las = 1,
border = "white"
)
# Add value labels on top of each bar
text(
x = seq(0.7, by = 1.2, length.out = nrow(plot_data)),
y = plot_data$grade + 0.5,
labels = plot_data$grade,
cex = 0.85
)
# Legend
legend("topright",
legend = names(source_colours),
fill = source_colours,
border = "white",
bty = "n",
title = "Source")

🔎 Step 11 — Inspect
Database Schema
# ─────────────────────────────────────────────────────────────────────────────
# dbGetQuery with PRAGMA is an SQLite-specific command to inspect the
# structure (column names and types) of any table — equivalent to DESCRIBE
# in MySQL or \d in PostgreSQL.
# ─────────────────────────────────────────────────────────────────────────────
cat("── Schema of 'all_students' table ───────────────────────────────────\n")
## ── Schema of 'all_students' table ───────────────────────────────────
schema <- DBI::dbGetQuery(con, "PRAGMA table_info(all_students)")
kable(schema[, c("cid","name","type","notnull","dflt_value","pk")],
caption = "Table Schema: all_students",
col.names = c("Col #", "Name", "Type", "Not NULL", "Default", "PK")) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table Schema: all_students
|
Col #
|
Name
|
Type
|
Not NULL
|
Default
|
PK
|
|
0
|
id
|
INTEGER
|
0
|
NA
|
0
|
|
1
|
name
|
TEXT
|
0
|
NA
|
0
|
|
2
|
age
|
INTEGER
|
0
|
NA
|
0
|
|
3
|
grade
|
REAL
|
0
|
NA
|
0
|
|
4
|
source
|
TEXT
|
0
|
NA
|
0
|
💾 Step 12 — Export
Results Back to Flat Files
# ─────────────────────────────────────────────────────────────────────────────
# EXPORTING FROM DATABASE → FILES
#
# You can export any query result back to a CSV, Excel, or other format.
# write.csv() is the simplest. row.names = FALSE prevents R from writing
# an extra row-number column that you almost never want.
# ─────────────────────────────────────────────────────────────────────────────
final_export <- DBI::dbGetQuery(con,
"SELECT * FROM all_students ORDER BY source, id"
)
write.csv(final_export, "all_students_export.csv", row.names = FALSE)
cat("✅ Exported to: all_students_export.csv\n")
## ✅ Exported to: all_students_export.csv
# Optionally save the high-grade students only
top_students <- DBI::dbGetQuery(con,
"SELECT * FROM all_students WHERE grade >= 15 ORDER BY grade DESC"
)
write.csv(top_students, "top_students_export.csv", row.names = FALSE)
cat("✅ Exported to: top_students_export.csv (", nrow(top_students), "rows)\n")
## ✅ Exported to: top_students_export.csv ( 7 rows)
🔒 Step 13 — Close the
Database Connection
# ─────────────────────────────────────────────────────────────────────────────
# ALWAYS disconnect from a database when you are done.
#
# Leaving connections open can:
# • Lock the database file (preventing other processes from writing)
# • Cause memory leaks in long-running R sessions
# • Produce warnings when R session ends
#
# DBI::dbDisconnect() cleanly closes the connection.
# ─────────────────────────────────────────────────────────────────────────────
DBI::dbDisconnect(con)
cat("✅ Database connection closed.\n")
## ✅ Database connection closed.
cat(" The database file '", "students_database.sqlite",
"' is still on disk and can be reopened anytime.\n", sep = "")
## The database file 'students_database.sqlite' is still on disk and can be reopened anytime.
📖 Quick Reference —
Package Summary
haven |
Read SPSS, SAS, Stata files |
read_sav(), read_sas(),
read_dta(), zap_labels() |
foreign |
Older SPSS/SAS reader |
read.spss(), read.ssd() |
DBI |
Database interface standard |
dbConnect(), dbGetQuery(),
dbWriteTable(), dbDisconnect() |
RSQLite |
SQLite driver for DBI |
SQLite() (used inside dbConnect) |
dplyr |
Data manipulation + lazy DB queries |
filter(), select(), tbl(),
collect() |
knitr / kableExtra |
Pretty HTML tables |
kable(), kable_styling() |
🚀 How to Publish to
RPubs
Follow these steps inside RStudio:
- Open this
.Rmd file in RStudio
- Click Knit → Knit to HTML
- After knitting, click the Publish button (top-right
of the preview)
- Choose RPubs
- Sign in or create a free account at rpubs.com
- Fill in the title and click Publish
📝 Note: Make sure your SPSS file
(data_spss.sav) is in the same folder as
this .Rmd file before knitting, or update the path on
line 1 of the Import SPSS section.
Generated with R 4.6.0 — 2026-05-24