Purpose of this document: A step-by-step walkthrough of how to import data from three major statistical software formats — SPSS (.sav), SAS (.sas7bdat), and Stata (.dta) — into R, store everything in a SQLite relational database, and query it back out.
Every section contains notes explaining why we do each step.

1 📦 Step 1 — Install & Load Required Packages

# ─────────────────────────────────────────────────────────────────────────────
# PACKAGE INSTALLATION
# install.packages() downloads a package from CRAN and installs it on your
# machine ONCE. After installation, you only need library() to load it.
#
# We use if (!require(...)) to avoid re-installing packages that are
# already present — this makes the document reproducible and fast.
#
# Packages we need:
#   haven      → reads SPSS (.sav), SAS (.sas7bdat), and Stata (.dta) files
#   foreign    → older alternative for SPSS/SAS (good backup option)
#   DBI        → unified database interface — the "grammar" for all databases
#   RSQLite    → SQLite database driver; works together with DBI
#   dplyr      → data manipulation (filter, select, mutate, summarise…)
#   knitr      → for kable() which prints beautiful HTML tables
#   kableExtra → adds extra styling to kable tables
# ─────────────────────────────────────────────────────────────────────────────

packages <- c("haven", "foreign", "DBI", "RSQLite",
              "dplyr", "knitr", "kableExtra")

for (pkg in packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
    library(pkg, character.only = TRUE)
  }
}

# ── Confirm everything loaded correctly ──────────────────────────────────────
cat("✅ Packages loaded successfully:\n")

## ✅ Packages loaded successfully:

cat(paste(" •", packages, collapse = "\n"), "\n")

##  • haven
##  • foreign
##  • DBI
##  • RSQLite
##  • dplyr
##  • knitr
##  • kableExtra

📝 Note: haven is the modern, tidyverse-compatible package for reading statistical files. It preserves SPSS/SAS/Stata labels, value labels, and missing value codes — something base R functions cannot do.

2 📂 Step 2 — Import SPSS Data (`.sav`)

# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING AN SPSS FILE
#
# SPSS saves data in a proprietary binary format (.sav).
# haven::read_sav() reads this file and converts it to an R tibble (a
# modern data frame). It also attaches any variable labels and value labels
# that were stored in the SPSS file as R attributes.
#
# IMPORTANT: Replace the path below with the actual path to YOUR .sav file.
# Examples:
#   Windows → "C:/Users/YourName/Documents/data_spss.sav"
#   Mac/Linux → "~/Documents/data_spss.sav"
#   Same folder as this .Rmd → "data_spss.sav"
# ─────────────────────────────────────────────────────────────────────────────

# ── 2a. Read the file ────────────────────────────────────────────────────────
spss_data <- haven::read_sav("data_spss.sav")

# ── 2b. Preview the raw import ───────────────────────────────────────────────
cat("── SPSS Import: Dimensions ──────────────────────────────────────────\n")

## ── SPSS Import: Dimensions ──────────────────────────────────────────

cat("Rows:", nrow(spss_data), "| Columns:", ncol(spss_data), "\n\n")

## Rows: 5 | Columns: 4

cat("── Column names and data types ─────────────────────────────────────\n")

## ── Column names and data types ─────────────────────────────────────

glimpse(spss_data)

## Rows: 5
## Columns: 4
## $ id    <dbl> 29013, 28285, 28924, 26003, 29408
## $ name  <chr> "Joy", "Peter", "Eric", "Claude", "Marry"
## $ age   <dbl> 20, 19, 23, 18, 20
## $ grade <dbl> 14.0, 12.5, 15.8, 17.0, 16.5

# ── 2c. Display as a formatted table ─────────────────────────────────────────
kable(spss_data,
      caption = "Table 1: Raw SPSS Data Import",
      align   = c("r","l","r","r")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE,
                position   = "left") |>
  column_spec(2, bold = TRUE, color = "#2c7fb8")

Table 1: Raw SPSS Data Import
id	name	age	grade
29013	Joy	20	14.0
28285	Peter	19	12.5
28924	Eric	23	15.8
26003	Claude	18	17.0
29408	Marry	20	16.5

# ── 2d. Clean and standardise column types ────────────────────────────────────
# haven imports SPSS labelled vectors — we convert them to plain R types
# using zap_labels(). This makes the data easier to work with downstream.

spss_clean <- spss_data |>
  haven::zap_labels() |>                 # remove SPSS label attributes
  mutate(
    id    = as.integer(id),              # SPSS stores IDs as float → convert
    name  = as.character(name),
    age   = as.integer(age),
    grade = as.numeric(grade),
    source = "SPSS"                      # track where each row came from
  )

cat("✅ SPSS data cleaned. Preview:\n")

## ✅ SPSS data cleaned. Preview:

print(spss_clean)

## # A tibble: 5 × 5
##      id name     age grade source
##   <int> <chr>  <int> <dbl> <chr> 
## 1 29013 Joy       20  14   SPSS  
## 2 28285 Peter     19  12.5 SPSS  
## 3 28924 Eric      23  15.8 SPSS  
## 4 26003 Claude    18  17   SPSS  
## 5 29408 Marry     20  16.5 SPSS

📝 Note on zap_labels(): SPSS often encodes categorical responses as numbers (e.g., 1 = Male, 2 = Female) with value labels. zap_labels() drops the label metadata so R treats the column as a plain number or character. If you want to keep labels as factors, use as_factor() instead.

3 📂 Step 3 — Import SAS Data (`.sas7bdat`)

# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING A SAS FILE
#
# SAS stores data in .sas7bdat binary files. haven::read_sas() reads them
# directly — no SAS licence required!
#
# If you do NOT have a real SAS file, the block below creates a synthetic
# dataset with the same structure as the SPSS file so the rest of the
# document runs without errors. Remove the `if` block once you have a
# real .sas7bdat file.
#
# NOTE: SAS files can also carry format catalogs (.sas7bcat). If you have
# one, pass it via: read_sas("data.sas7bdat", catalog_file = "formats.sas7bcat")
# ─────────────────────────────────────────────────────────────────────────────

sas_file_path <- "data_sas.sas7bdat"   # ← change to your actual path

if (file.exists(sas_file_path)) {

  # Real SAS file present — read it
  sas_data <- haven::read_sas(sas_file_path)
  cat("✅ Real SAS file loaded from:", sas_file_path, "\n")

} else {

  # No SAS file found — create a synthetic demo dataset
  # (mirrors the columns in the SPSS data: id, name, age, grade)
  cat("ℹ️  No SAS file found at '", sas_file_path,
      "'. Creating a synthetic SAS-like demo dataset.\n", sep = "")

  sas_data <- tibble::tibble(
    id    = c(30001L, 30002L, 30003L, 30004L),
    name  = c("Alice", "Bob", "Diana", "Frank"),
    age   = c(22L, 25L, 21L, 24L),
    grade = c(15.0, 13.5, 17.2, 11.8)
  )
}

## ℹ️  No SAS file found at 'data_sas.sas7bdat'. Creating a synthetic SAS-like demo dataset.

# ── Clean and add source tag ──────────────────────────────────────────────────
sas_clean <- sas_data |>
  haven::zap_labels() |>
  mutate(
    id     = as.integer(id),
    name   = as.character(name),
    age    = as.integer(age),
    grade  = as.numeric(grade),
    source = "SAS"
  )

cat("\n── SAS Data Preview ─────────────────────────────────────────────────\n")

## 
## ── SAS Data Preview ─────────────────────────────────────────────────

print(sas_clean)

## # A tibble: 4 × 5
##      id name    age grade source
##   <int> <chr> <int> <dbl> <chr> 
## 1 30001 Alice    22  15   SAS   
## 2 30002 Bob      25  13.5 SAS   
## 3 30003 Diana    21  17.2 SAS   
## 4 30004 Frank    24  11.8 SAS

kable(sas_clean,
      caption = "Table 2: SAS Data (real file or demo)",
      align   = c("r","l","r","r","l")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE) |>
  column_spec(2, bold = TRUE, color = "#31a354")

Table 2: SAS Data (real file or demo)
id	name	age	grade	source
30001	Alice	22	15.0	SAS
30002	Bob	25	13.5	SAS
30003	Diana	21	17.2	SAS
30004	Frank	24	11.8	SAS

📝 Note: haven::read_sas() also reads SAS Transport files (.xpt). These are used by the FDA and CDC for public data releases. Use
haven::read_xpt("file.xpt") for those.

4 📂 Step 4 — Import Stata Data (`.dta`)

# ─────────────────────────────────────────────────────────────────────────────
# IMPORTING A STATA FILE
#
# Stata saves data as .dta files. haven::read_dta() supports Stata 8–18.
# Stata files often contain:
#   - Variable labels (long descriptions of each column)
#   - Value labels   (text labels for numeric codes)
#   - Notes          (comments attached to variables)
# haven preserves all of these as R attributes.
#
# Same as SAS: a synthetic dataset is created if no real file is found.
# ─────────────────────────────────────────────────────────────────────────────

stata_file_path <- "data_stata.dta"    # ← change to your actual path

if (file.exists(stata_file_path)) {

  stata_data <- haven::read_dta(stata_file_path)
  cat("✅ Real Stata file loaded from:", stata_file_path, "\n")

} else {

  cat("ℹ️  No Stata file found at '", stata_file_path,
      "'. Creating a synthetic Stata-like demo dataset.\n", sep = "")

  stata_data <- tibble::tibble(
    id    = c(40001L, 40002L, 40003L),
    name  = c("Grace", "Henry", "Iris"),
    age   = c(20L, 27L, 23L),
    grade = c(16.0, 14.5, 18.0)
  )
}

## ℹ️  No Stata file found at 'data_stata.dta'. Creating a synthetic Stata-like demo dataset.

# ── Clean and add source tag ──────────────────────────────────────────────────
stata_clean <- stata_data |>
  haven::zap_labels() |>
  mutate(
    id     = as.integer(id),
    name   = as.character(name),
    age    = as.integer(age),
    grade  = as.numeric(grade),
    source = "Stata"
  )

cat("\n── Stata Data Preview ───────────────────────────────────────────────\n")

## 
## ── Stata Data Preview ───────────────────────────────────────────────

print(stata_clean)

## # A tibble: 3 × 5
##      id name    age grade source
##   <int> <chr> <int> <dbl> <chr> 
## 1 40001 Grace    20  16   Stata 
## 2 40002 Henry    27  14.5 Stata 
## 3 40003 Iris     23  18   Stata

kable(stata_clean,
      caption = "Table 3: Stata Data (real file or demo)",
      align   = c("r","l","r","r","l")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE) |>
  column_spec(2, bold = TRUE, color = "#d95f0e")

Table 3: Stata Data (real file or demo)
id	name	age	grade	source
40001	Grace	20	16.0	Stata
40002	Henry	27	14.5	Stata
40003	Iris	23	18.0	Stata

📝 Note on Stata versions: Stata 13+ uses a different file encoding than older versions. haven handles all modern versions automatically.
For very old Stata 5–7 files, the foreign package’s read.dta() is a reliable fallback.

5 🔗 Step 5 — Combine All Sources into One Data Frame

# ─────────────────────────────────────────────────────────────────────────────
# COMBINING ALL THREE DATASETS
#
# dplyr::bind_rows() stacks data frames vertically (like SQL UNION ALL).
# It matches columns by NAME, not position — so column order doesn't matter.
# Missing columns in any data frame are filled with NA automatically.
#
# The 'source' column we added earlier lets us trace every record back to
# its original file after the data is combined.
# ─────────────────────────────────────────────────────────────────────────────

all_data <- dplyr::bind_rows(spss_clean, sas_clean, stata_clean)

cat("── Combined Dataset ─────────────────────────────────────────────────\n")

## ── Combined Dataset ─────────────────────────────────────────────────

cat("Total rows:", nrow(all_data),
    "| Sources:", paste(unique(all_data$source), collapse = ", "), "\n\n")

## Total rows: 12 | Sources: SPSS, SAS, Stata

print(all_data)

## # A tibble: 12 × 5
##       id name     age grade source
##    <int> <chr>  <int> <dbl> <chr> 
##  1 29013 Joy       20  14   SPSS  
##  2 28285 Peter     19  12.5 SPSS  
##  3 28924 Eric      23  15.8 SPSS  
##  4 26003 Claude    18  17   SPSS  
##  5 29408 Marry     20  16.5 SPSS  
##  6 30001 Alice     22  15   SAS   
##  7 30002 Bob       25  13.5 SAS   
##  8 30003 Diana     21  17.2 SAS   
##  9 30004 Frank     24  11.8 SAS   
## 10 40001 Grace     20  16   Stata 
## 11 40002 Henry     27  14.5 Stata 
## 12 40003 Iris      23  18   Stata

kable(all_data,
      caption = "Table 4: All Sources Combined",
      align   = c("r","l","r","r","l")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),
                full_width = FALSE) |>
  row_spec(1:nrow(spss_clean), background = "#eaf4fb") |>   # SPSS rows → blue
  row_spec((nrow(spss_clean)+1):(nrow(spss_clean)+nrow(sas_clean)),
           background = "#eafbea") |>                        # SAS rows  → green
  row_spec((nrow(spss_clean)+nrow(sas_clean)+1):nrow(all_data),
           background = "#fdf3e7")                           # Stata rows → orange

Table 4: All Sources Combined
id	name	age	grade	source
29013	Joy	20	14.0	SPSS
28285	Peter	19	12.5	SPSS
28924	Eric	23	15.8	SPSS
26003	Claude	18	17.0	SPSS
29408	Marry	20	16.5	SPSS
30001	Alice	22	15.0	SAS
30002	Bob	25	13.5	SAS
30003	Diana	21	17.2	SAS
30004	Frank	24	11.8	SAS
40001	Grace	20	16.0	Stata
40002	Henry	27	14.5	Stata
40003	Iris	23	18.0	Stata

6 🗄️ Step 6 — Create & Connect to a SQLite Database

# ─────────────────────────────────────────────────────────────────────────────
# WHAT IS SQLite?
# SQLite is a self-contained, serverless SQL database engine. The entire
# database is stored in a single .sqlite file on disk. It is perfect for:
#   ✔ Learning SQL
#   ✔ Small-to-medium datasets
#   ✔ Reproducible research (one file = one database)
#   ✔ Sharing data with collaborators
#
# WHAT IS DBI?
# DBI (DataBase Interface) is R's standard way to talk to databases.
# It provides a common set of functions regardless of which database you use:
#   dbConnect()      → open a connection
#   dbWriteTable()   → write an R data frame to a database table
#   dbGetQuery()     → run SQL and return results as a data frame
#   dbListTables()   → list all tables
#   dbDisconnect()   → close the connection (important — always do this!)
# ─────────────────────────────────────────────────────────────────────────────

# ── 6a. Create / open the database file ──────────────────────────────────────
db_path <- "students_database.sqlite"

con <- DBI::dbConnect(
  drv    = RSQLite::SQLite(),    # which database engine to use
  dbname = db_path               # file path; use ":memory:" for RAM-only DB
)

cat("✅ Database connected:", db_path, "\n")

## ✅ Database connected: students_database.sqlite

cat("   Connection class:", class(con), "\n")

##    Connection class: SQLiteConnection

📝 Note: Using ":memory:" instead of a file path creates an in-memory database that disappears when R closes. Use a file path (like above) when you want the data to persist between R sessions.

7 📤 Step 7 — Write Data to the Database

# ─────────────────────────────────────────────────────────────────────────────
# WRITING R DATA FRAMES TO SQL TABLES
#
# dbWriteTable() creates a new table (or replaces it with overwrite = TRUE)
# and inserts all rows from the R data frame.
#
# We create THREE separate tables — one per source — plus ONE combined table.
# This mirrors how a real research database is often structured:
#   raw_spss   → original SPSS records
#   raw_sas    → original SAS records
#   raw_stata  → original Stata records
#   all_students → combined view of all three
# ─────────────────────────────────────────────────────────────────────────────

DBI::dbWriteTable(con, "raw_spss",      spss_clean,  overwrite = TRUE)
DBI::dbWriteTable(con, "raw_sas",       sas_clean,   overwrite = TRUE)
DBI::dbWriteTable(con, "raw_stata",     stata_clean, overwrite = TRUE)
DBI::dbWriteTable(con, "all_students",  all_data,    overwrite = TRUE)

# ── Confirm tables were created ───────────────────────────────────────────────
tables_in_db <- DBI::dbListTables(con)
cat("✅ Tables now in database:\n")

## ✅ Tables now in database:

cat(paste(" •", tables_in_db, collapse = "\n"), "\n")

##  • all_students
##  • raw_sas
##  • raw_spss
##  • raw_stata

# ── Verify row counts match what we loaded ───────────────────────────────────
for (tbl in tables_in_db) {
  n <- DBI::dbGetQuery(con, paste("SELECT COUNT(*) AS n FROM", tbl))$n
  cat(sprintf("   %-20s  %d rows\n", tbl, n))
}

##    all_students          12 rows
##    raw_sas               4 rows
##    raw_spss              5 rows
##    raw_stata             3 rows

8 🔍 Step 8 — Query the Database with SQL

# ─────────────────────────────────────────────────────────────────────────────
# QUERYING WITH SQL
#
# dbGetQuery(connection, sql_string) sends a SQL SELECT statement to SQLite
# and returns the result as an R data frame — ready to use immediately.
#
# SQL cheatsheet used below:
#   SELECT    → choose columns
#   FROM      → specify the table
#   WHERE     → filter rows (like dplyr::filter)
#   ORDER BY  → sort rows
#   GROUP BY  → aggregate rows
#   AVG()     → average
#   COUNT()   → count rows
#   ROUND()   → round a number to N decimal places
# ─────────────────────────────────────────────────────────────────────────────

# ── Query 1: All students, sorted by grade descending ────────────────────────
cat("── Query 1: All students ordered by grade ───────────────────────────\n")

## ── Query 1: All students ordered by grade ───────────────────────────

q1 <- DBI::dbGetQuery(con,
  "SELECT id, name, age, grade, source
   FROM   all_students
   ORDER  BY grade DESC"
)

kable(q1, caption = "Query 1 — All Students by Grade (highest first)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Query 1 — All Students by Grade (highest first)
id	name	age	grade	source
40003	Iris	23	18.0	Stata
30003	Diana	21	17.2	SAS
26003	Claude	18	17.0	SPSS
29408	Marry	20	16.5	SPSS
40001	Grace	20	16.0	Stata
28924	Eric	23	15.8	SPSS
30001	Alice	22	15.0	SAS
40002	Henry	27	14.5	Stata
29013	Joy	20	14.0	SPSS
30002	Bob	25	13.5	SAS
28285	Peter	19	12.5	SPSS
30004	Frank	24	11.8	SAS

# ── Query 2: Students with grade ≥ 15 ────────────────────────────────────────
cat("── Query 2: High-performing students (grade ≥ 15) ───────────────────\n")

## ── Query 2: High-performing students (grade ≥ 15) ───────────────────

q2 <- DBI::dbGetQuery(con,
  "SELECT id, name, age, grade, source
   FROM   all_students
   WHERE  grade >= 15
   ORDER  BY grade DESC"
)

kable(q2, caption = "Query 2 — Students with Grade ≥ 15") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Query 2 — Students with Grade ≥ 15
id	name	age	grade	source
40003	Iris	23	18.0	Stata
30003	Diana	21	17.2	SAS
26003	Claude	18	17.0	SPSS
29408	Marry	20	16.5	SPSS
40001	Grace	20	16.0	Stata
28924	Eric	23	15.8	SPSS
30001	Alice	22	15.0	SAS

# ── Query 3: Summary statistics per source ───────────────────────────────────
cat("── Query 3: Summary by data source ─────────────────────────────────\n")

## ── Query 3: Summary by data source ─────────────────────────────────

q3 <- DBI::dbGetQuery(con,
  "SELECT   source,
            COUNT(*)         AS total_students,
            ROUND(AVG(age),  1) AS avg_age,
            ROUND(AVG(grade),2) AS avg_grade,
            ROUND(MIN(grade),2) AS min_grade,
            ROUND(MAX(grade),2) AS max_grade
   FROM     all_students
   GROUP BY source
   ORDER BY source"
)

kable(q3, caption = "Query 3 — Summary Statistics by Source File") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Query 3 — Summary Statistics by Source File
source	total_students	avg_age	avg_grade	min_grade	max_grade
SAS	4	23.0	14.38	11.8	17.2
SPSS	5	20.0	15.16	12.5	17.0
Stata	3	23.3	16.17	14.5	18.0

# ── Query 4: Only the SPSS table ─────────────────────────────────────────────
cat("── Query 4: Reading directly from the raw_spss table ───────────────\n")

## ── Query 4: Reading directly from the raw_spss table ───────────────

q4 <- DBI::dbGetQuery(con,
  "SELECT * FROM raw_spss ORDER BY id"
)

kable(q4, caption = "Query 4 — Raw SPSS Table") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Query 4 — Raw SPSS Table
id	name	age	grade	source
26003	Claude	18	17.0	SPSS
28285	Peter	19	12.5	SPSS
28924	Eric	23	15.8	SPSS
29013	Joy	20	14.0	SPSS
29408	Marry	20	16.5	SPSS

9 🛠️ Step 9 — Using dplyr with the Database (dbplyr approach)

# ─────────────────────────────────────────────────────────────────────────────
# DPLYR + DATABASE (lazy evaluation)
#
# tbl(connection, "table_name") creates a "lazy" reference to a database table.
# You write normal dplyr code, but dplyr translates it to SQL behind the scenes
# and only fetches data when you call collect().
#
# This is powerful for LARGE tables — the filtering happens IN the database,
# not in R memory. Only the result is transferred to R.
# ─────────────────────────────────────────────────────────────────────────────

students_tbl <- dplyr::tbl(con, "all_students")

# Build a query with dplyr verbs (nothing runs yet — it's lazy)
dplyr_result <- students_tbl |>
  dplyr::filter(age >= 20) |>
  dplyr::select(name, age, grade, source) |>
  dplyr::arrange(dplyr::desc(grade))

# See the SQL that dplyr generated
cat("── SQL generated by dplyr ───────────────────────────────────────────\n")

## ── SQL generated by dplyr ───────────────────────────────────────────

dplyr::show_query(dplyr_result)

## <SQL>
## SELECT `name`, `age`, `grade`, `source`
## FROM `all_students`
## WHERE (`age` >= 20.0)
## ORDER BY `grade` DESC

# Now actually fetch the data into R
cat("\n── Collected result ─────────────────────────────────────────────────\n")

## 
## ── Collected result ─────────────────────────────────────────────────

result_df <- dplyr::collect(dplyr_result)
print(result_df)

## # A tibble: 10 × 4
##    name    age grade source
##    <chr> <int> <dbl> <chr> 
##  1 Iris     23  18   Stata 
##  2 Diana    21  17.2 SAS   
##  3 Marry    20  16.5 SPSS  
##  4 Grace    20  16   Stata 
##  5 Eric     23  15.8 SPSS  
##  6 Alice    22  15   SAS   
##  7 Henry    27  14.5 Stata 
##  8 Joy      20  14   SPSS  
##  9 Bob      25  13.5 SAS   
## 10 Frank    24  11.8 SAS

kable(result_df,
      caption = "dplyr Query: Students aged 20+ sorted by grade") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

dplyr Query: Students aged 20+ sorted by grade
name	age	grade	source
Iris	23	18.0	Stata
Diana	21	17.2	SAS
Marry	20	16.5	SPSS
Grace	20	16.0	Stata
Eric	23	15.8	SPSS
Alice	22	15.0	SAS
Henry	27	14.5	Stata
Joy	20	14.0	SPSS
Bob	25	13.5	SAS
Frank	24	11.8	SAS

📝 Note: The lazy/collect pattern is the key difference between
dplyr on an R data frame (executes immediately) vs dplyr on a database
(executes only when collect() is called). Always call collect() when
you are done filtering and want to work with the result in R.

10 📊 Step 10 — Simple Visualisation

# ─────────────────────────────────────────────────────────────────────────────
# We query the database one more time and plot the results.
# This shows the full pipeline: Database → R data frame → plot.
# ─────────────────────────────────────────────────────────────────────────────

plot_data <- DBI::dbGetQuery(con,
  "SELECT name, grade, source FROM all_students ORDER BY grade DESC"
)

# Colour palette — one colour per source
source_colours <- c(SPSS = "#2c7fb8", SAS = "#31a354", Stata = "#d95f0e")

barplot(
  height = plot_data$grade,
  names.arg = plot_data$name,
  col  = source_colours[plot_data$source],
  main = "Student Grades by Source File",
  xlab = "Student Name",
  ylab = "Grade",
  ylim = c(0, 22),
  las  = 1,
  border = "white"
)

# Add value labels on top of each bar
text(
  x      = seq(0.7, by = 1.2, length.out = nrow(plot_data)),
  y      = plot_data$grade + 0.5,
  labels = plot_data$grade,
  cex    = 0.85
)

# Legend
legend("topright",
       legend = names(source_colours),
       fill   = source_colours,
       border = "white",
       bty    = "n",
       title  = "Source")

11 🔎 Step 11 — Inspect Database Schema

# ─────────────────────────────────────────────────────────────────────────────
# dbGetQuery with PRAGMA is an SQLite-specific command to inspect the
# structure (column names and types) of any table — equivalent to DESCRIBE
# in MySQL or \d in PostgreSQL.
# ─────────────────────────────────────────────────────────────────────────────

cat("── Schema of 'all_students' table ───────────────────────────────────\n")

## ── Schema of 'all_students' table ───────────────────────────────────

schema <- DBI::dbGetQuery(con, "PRAGMA table_info(all_students)")
kable(schema[, c("cid","name","type","notnull","dflt_value","pk")],
      caption = "Table Schema: all_students",
      col.names = c("Col #", "Name", "Type", "Not NULL", "Default", "PK")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table Schema: all_students
Col #	Name	Type	Default
0	id	INTEGER	NA
1	name	TEXT	NA
2	age	INTEGER	NA
3	grade	REAL	NA
4	source	TEXT	NA

12 💾 Step 12 — Export Results Back to Flat Files

# ─────────────────────────────────────────────────────────────────────────────
# EXPORTING FROM DATABASE → FILES
#
# You can export any query result back to a CSV, Excel, or other format.
# write.csv() is the simplest. row.names = FALSE prevents R from writing
# an extra row-number column that you almost never want.
# ─────────────────────────────────────────────────────────────────────────────

final_export <- DBI::dbGetQuery(con,
  "SELECT * FROM all_students ORDER BY source, id"
)

write.csv(final_export, "all_students_export.csv", row.names = FALSE)
cat("✅ Exported to: all_students_export.csv\n")

## ✅ Exported to: all_students_export.csv

# Optionally save the high-grade students only
top_students <- DBI::dbGetQuery(con,
  "SELECT * FROM all_students WHERE grade >= 15 ORDER BY grade DESC"
)
write.csv(top_students, "top_students_export.csv", row.names = FALSE)
cat("✅ Exported to: top_students_export.csv (", nrow(top_students), "rows)\n")

## ✅ Exported to: top_students_export.csv ( 7 rows)

13 🔒 Step 13 — Close the Database Connection

# ─────────────────────────────────────────────────────────────────────────────
# ALWAYS disconnect from a database when you are done.
#
# Leaving connections open can:
#   • Lock the database file (preventing other processes from writing)
#   • Cause memory leaks in long-running R sessions
#   • Produce warnings when R session ends
#
# DBI::dbDisconnect() cleanly closes the connection.
# ─────────────────────────────────────────────────────────────────────────────

DBI::dbDisconnect(con)
cat("✅ Database connection closed.\n")

## ✅ Database connection closed.

cat("   The database file '", "students_database.sqlite",
    "' is still on disk and can be reopened anytime.\n", sep = "")

##    The database file 'students_database.sqlite' is still on disk and can be reopened anytime.

14 📋 Step 14 — Session Information

# ─────────────────────────────────────────────────────────────────────────────
# sessionInfo() prints the exact versions of R and every loaded package.
# This is ESSENTIAL for reproducibility — it lets anyone recreate the exact
# environment you used when running this document.
# Always include it at the end of a published document.
# ─────────────────────────────────────────────────────────────────────────────

sessionInfo()

## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Rwanda.utf8  LC_CTYPE=English_Rwanda.utf8   
## [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Rwanda.utf8    
## 
## time zone: Africa/Kigali
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51       dplyr_1.2.1      RSQLite_3.52.0  
## [5] DBI_1.3.0        foreign_0.8-91   haven_2.5.5     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        utf8_1.2.6         generics_0.1.4     xml2_1.5.2        
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.5    
##  [9] evaluate_1.0.5     RColorBrewer_1.1-3 fastmap_1.2.0      blob_1.3.0        
## [13] jsonlite_2.0.0     purrr_1.2.2        viridisLite_0.4.3  scales_1.4.0      
## [17] textshaping_1.0.5  jquerylib_0.1.4    cli_3.6.6          rlang_1.2.0       
## [21] dbplyr_2.5.2       bit64_4.8.2        withr_3.0.2        cachem_1.1.0      
## [25] yaml_2.3.12        tools_4.6.0        tzdb_0.5.0         memoise_2.0.1     
## [29] forcats_1.0.1      vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5   
## [33] stringr_1.6.0      bit_4.6.0          pkgconfig_2.0.3    pillar_1.11.1     
## [37] bslib_0.11.0       glue_1.8.1         systemfonts_1.3.2  xfun_0.57         
## [41] tibble_3.3.1       tidyselect_1.2.1   rstudioapi_0.18.0  farver_2.1.2      
## [45] htmltools_0.5.9    rmarkdown_2.31     svglite_2.2.2      readr_2.2.0       
## [49] compiler_4.6.0

15 📖 Quick Reference — Package Summary

Package	What it does	Key functions
`haven`	Read SPSS, SAS, Stata files	`read_sav()`, `read_sas()`, `read_dta()`, `zap_labels()`
`foreign`	Older SPSS/SAS reader	`read.spss()`, `read.ssd()`
`DBI`	Database interface standard	`dbConnect()`, `dbGetQuery()`, `dbWriteTable()`, `dbDisconnect()`
`RSQLite`	SQLite driver for DBI	`SQLite()` (used inside `dbConnect`)
`dplyr`	Data manipulation + lazy DB queries	`filter()`, `select()`, `tbl()`, `collect()`
`knitr` / `kableExtra`	Pretty HTML tables	`kable()`, `kable_styling()`

16 🚀 How to Publish to RPubs

Follow these steps inside RStudio:

Open this .Rmd file in RStudio
Click Knit → Knit to HTML
After knitting, click the Publish button (top-right of the preview)
Choose RPubs
Sign in or create a free account at rpubs.com
Fill in the title and click Publish

📝 Note: Make sure your SPSS file (data_spss.sav) is in the same folder as this .Rmd file before knitting, or update the path on line 1 of the Import SPSS section.

Generated with R 4.6.0 — 2026-05-24

📊 Importing Statistical Data Files & Building a SQLite Database in R

Your Name

2026-05-24

1 📦 Step 1 — Install & Load Required Packages

2 📂 Step 2 — Import SPSS Data (`.sav`)

3 📂 Step 3 — Import SAS Data (`.sas7bdat`)

4 📂 Step 4 — Import Stata Data (`.dta`)

5 🔗 Step 5 — Combine All Sources into One Data Frame

6 🗄️ Step 6 — Create & Connect to a SQLite Database

7 📤 Step 7 — Write Data to the Database

8 🔍 Step 8 — Query the Database with SQL

9 🛠️ Step 9 — Using dplyr with the Database (dbplyr approach)

10 📊 Step 10 — Simple Visualisation

11 🔎 Step 11 — Inspect Database Schema

12 💾 Step 12 — Export Results Back to Flat Files

13 🔒 Step 13 — Close the Database Connection

14 📋 Step 14 — Session Information

15 📖 Quick Reference — Package Summary

16 🚀 How to Publish to RPubs

📊 Importing Statistical Data Files & Building a SQLite Database in R

Your Name

2026-05-24

1 📦 Step 1 — Install & Load Required Packages

2 📂 Step 2 — Import SPSS Data (.sav)

3 📂 Step 3 — Import SAS Data (.sas7bdat)

4 📂 Step 4 — Import Stata Data (.dta)

5 🔗 Step 5 — Combine All Sources into One Data Frame

6 🗄️ Step 6 — Create & Connect to a SQLite Database

7 📤 Step 7 — Write Data to the Database

8 🔍 Step 8 — Query the Database with SQL

9 🛠️ Step 9 — Using dplyr with the Database (dbplyr approach)

10 📊 Step 10 — Simple Visualisation

11 🔎 Step 11 — Inspect Database Schema

12 💾 Step 12 — Export Results Back to Flat Files

13 🔒 Step 13 — Close the Database Connection

14 📋 Step 14 — Session Information

15 📖 Quick Reference — Package Summary

16 🚀 How to Publish to RPubs

2 📂 Step 2 — Import SPSS Data (`.sav`)

3 📂 Step 3 — Import SAS Data (`.sas7bdat`)

4 📂 Step 4 — Import Stata Data (`.dta`)