Data Importation Masterclass: From Files to R Dataframes

A comprehensive guide on preparing and importing CSV, TXT, XLSX, and SPSS files into R using the Tidyverse ecosystem.

Author

Abdullah Al Shamim

1 The Philosophy of Machine-Readable Data

Before you start, I recommend you to watch this tutorial first: click here

Note

Before writing a single line of code, your data must be “clean.” R expects a Tidy Structure: - Each Variable is a column. - Each Observation is a row. - Each Value is a single cell.

File Format Best Practices

1. CSV (The Gold Standard)

Encoding: Always save as UTF-8 to prevent gibberish characters.

Quotes: Ensure text containing commas (e.g., “Dhaka, Bangladesh”) is wrapped in double quotes.

Consistency: Every row must have the same number of delimiters.

2. XLSX (Microsoft Excel)

One Sheet, One Table: Avoid storing multiple unrelated tables on a single sheet.

No Merged Cells: R will only read the value in the first cell and mark the rest as NA.

No Color Coding: R does not natively recognize “cell fill” as a data category. Create a status column instead.

3. TXT (Plain Text/Tab-Delimited)

Delimiters: Use Tabs ( rather than spaces to avoid misinterpreting multi-word strings (e.g., “New York”).

Metadata: Start any non-data notes with a # so R can skip them.

4. SPSS (.sav)

Labels: R can extract variable and value labels (e.g., 1 = Male) using the haven package.

Naming: Keep variable names under 32 characters to avoid truncation.

Summary Checklist for Data Prep

Feature Best Practice Why?
First Row Column Names (Headers) R uses these as variable names.
Missing Data Leave empty or use NA Prevents R from guessing the wrong data type.
Data Types One type per column A single “Text” in a “Numeric” column breaks calculations.
Dates YYYY-MM-DD ISO standard; prevents Day/Month confusion.
Special Chars Avoid @, #, $, %, ^ These are often reserved for code logic.

Implementation: Importing Data into R

To import data efficiently, we use the readr, readxl, and haven packages. These are significantly faster and more consistent than R’s “base” functions.

The readr package handles these with high performance.

Code
# Install and load the package
if(!require(readr)) install.packages("readr")
library(readr)

# 1. Import CSV
data_csv <- read_csv("friends.csv", 
                     na = c("", "NA", "Unknown"), 
                     show_col_types = FALSE)

The readxl package is the industry standard and does not require Java.

Code
if(!require(readxl)) install.packages("readxl")
library(readxl)

# Basic import (Defaults to first sheet)
data_xlsx <- read_excel("friends.xlsx", sheet = 1)

# Advanced: Import specific range and sheet
data_subset <- read_excel("friends.xlsx", 
                          sheet = 1, 
                          range = "A1:E50")

The readr package handles these with high performance.

Code
# Install and load the package
if(!require(readr)) install.packages("readr")
library(readr)

# 2. Import Tab-Delimited (.txt)
data_txt <- read_tsv("friends.txt")

# 3. Custom Delimiter (e.g., a pipe '|')
data_custom <- read_delim("friends.txt", delim = "|")

The haven package preserves the rich metadata found in social science data.

Code
if(!require(haven)) install.packages("haven")
library(haven)

# Import SPSS file
data_spss <- read_sav("friends.sav")

# Convert SPSS value labels into R factors
library(dplyr)
data_spss <- data_spss %>% as_factor()

Advanced Import Pro-Tips

Imported names like First Name (2023) are hard to code. Use the janitor package to clean them instantly:

Code
if(!require(janitor)) install.packages("janitor")
data_clean <- data_csv %>% clean_names()

# Result: "first_name_2023"

If your file is larger than 500MB, read_csv might be slow. Use data.table for maximum speed. Suppose this friend.csv file is larger than 500MB:

Code
if(!require(data.table)) install.packages("data.table")
data_huge <- fread("friends.csv")

Never use absolute paths (e.g., C:/Users/Downloads/…). Instead setup Working Directory:

  • Go to File > New Project.
  • Store your data in that project folder.
  • Use relative paths: read_csv(“data.csv”). This makes your code reproducible for others.

Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.