1 The Philosophy of Machine-Readable Data
Before you start, I recommend you to watch this tutorial first: click here
Before writing a single line of code, your data must be “clean.” R expects a Tidy Structure: - Each Variable is a column. - Each Observation is a row. - Each Value is a single cell.
File Format Best Practices
1. CSV (The Gold Standard)
Encoding: Always save as UTF-8 to prevent gibberish characters.
Quotes: Ensure text containing commas (e.g., “Dhaka, Bangladesh”) is wrapped in double quotes.
Consistency: Every row must have the same number of delimiters.
2. XLSX (Microsoft Excel)
One Sheet, One Table: Avoid storing multiple unrelated tables on a single sheet.
No Merged Cells: R will only read the value in the first cell and mark the rest as NA.
No Color Coding: R does not natively recognize “cell fill” as a data category. Create a status column instead.
3. TXT (Plain Text/Tab-Delimited)
Delimiters: Use Tabs ( rather than spaces to avoid misinterpreting multi-word strings (e.g., “New York”).
Metadata: Start any non-data notes with a # so R can skip them.
4. SPSS (.sav)
Labels: R can extract variable and value labels (e.g., 1 = Male) using the haven package.
Naming: Keep variable names under 32 characters to avoid truncation.
Summary Checklist for Data Prep
| Feature | Best Practice | Why? |
| First Row | Column Names (Headers) | R uses these as variable names. |
| Missing Data | Leave empty or use NA |
Prevents R from guessing the wrong data type. |
| Data Types | One type per column | A single “Text” in a “Numeric” column breaks calculations. |
| Dates | YYYY-MM-DD |
ISO standard; prevents Day/Month confusion. |
| Special Chars | Avoid @, #, $, %, ^ |
These are often reserved for code logic. |
Implementation: Importing Data into R
To import data efficiently, we use the readr, readxl, and haven packages. These are significantly faster and more consistent than R’s “base” functions.
The readr package handles these with high performance.
The readxl package is the industry standard and does not require Java.
The readr package handles these with high performance.
Advanced Import Pro-Tips
Imported names like First Name (2023) are hard to code. Use the janitor package to clean them instantly:
If your file is larger than 500MB, read_csv might be slow. Use data.table for maximum speed. Suppose this friend.csv file is larger than 500MB:
Never use absolute paths (e.g., C:/Users/Downloads/…). Instead setup Working Directory:
- Go to File > New Project.
- Store your data in that project folder.
- Use relative paths: read_csv(“data.csv”). This makes your code reproducible for others.
Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.