In R, every column has a Class (Data Type). If the class is wrong (e.g., a “Size” category being treated as plain text), your analysis and plots will be incorrect. We will use the friends.csv dataset to demonstrate this transformation.
Phase 1: Data Acquisition and Initial Inquiry
The first step is bringing the data into R and performing a “Structural Audit.”
We use read.csv() to fetch data from a URL.
Before changing anything, we must see how R automatically interpreted the columns.
'data.frame': 10 obs. of 6 variables:
$ Name : chr "Rahim" "Karim" "Badsha" "Rafiq" ...
$ Sex : chr "Male" "Male" "Male" "Male" ...
$ Eye.color: chr "Brown" "Blue" "Black" "Brown" ...
$ Age : int 34 33 40 36 43 44 52 29 43 51
$ Height : num 1.77 1.76 1.87 1.57 1.98 1.66 1.65 1.99 1.55 1.88
$ Size : chr "Medium" "Medium" "Tall" "Short" ...
A systemic look at the distribution, means, and missing values of the dataset.
Name Sex Eye.color Age
Length:10 Length:10 Length:10 Min. :29.00
Class :character Class :character Class :character 1st Qu.:34.50
Mode :character Mode :character Mode :character Median :41.50
Mean :40.50
3rd Qu.:43.75
Max. :52.00
Height Size
Min. :1.550 Length:10
1st Qu.:1.653 Class :character
Median :1.765 Mode :character
Mean :1.768
3rd Qu.:1.877
Max. :1.990
Phase 2: Variable Type Transformation (Refining the Data)
Often, R imports categories as Character (text) and numbers as Numeric (decimals). For better memory management and analysis, we must convert them to Factor and Integer.
The Size column contains categories (Short, Medium, Tall). We convert it to a Factor so R treats it as a grouping variable.
Phase 3: Advanced Factor Control (Levels & Ordering)
By default, R sorts factors alphabetically (Medium, Short, Tall). However, “Size” has a logical order. We must manually define the Levels.
We tell R that the logical flow is Short \(\rightarrow\) Medium \(\rightarrow\) Tall.
Phase 4: Feature Engineering (Creating Logical Rules)
We can create new columns based on logical conditions. These are called Logical Vectors (TRUE or FALSE).
Let’s create a column called Old for anyone over the age of 40.
Now, look at the structure again to see how much more professional the dataset looks.
'data.frame': 10 obs. of 7 variables:
$ Name : chr "Rahim" "Karim" "Badsha" "Rafiq" ...
$ Sex : chr "Male" "Male" "Male" "Male" ...
$ Eye.color: chr "Brown" "Blue" "Black" "Brown" ...
$ Age : int 34 33 40 36 43 44 52 29 43 51
$ Height : num 1.77 1.76 1.87 1.57 1.98 1.66 1.65 1.99 1.55 1.88
$ Size : Factor w/ 3 levels "Short","Medium",..: 2 2 3 1 3 1 1 3 1 3
$ Old : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
🎓 Systemic Summary for Learners
| Function | Purpose | Beginner Tip |
str() |
Shows structure | Always use this first. It is the “X-ray” of your data. |
as.factor() |
Creates categories | Use this for columns like Gender, Color, or Rank. |
levels() |
Checks group order | Essential for ensuring your Bar Charts don’t look messy. |
Logical (>) |
Creates flags | Used for filtering or creating “Binary” (Yes/No) variables. |
Pro-Learner Insight: The “Why”
Why Factors? When you run a regression or create a plot, R uses Factors to create separate bars or group colors.
Why Integers? Using integers for columns like
AgeorIDprevents R from calculating meaningless decimals (like age 34.00000001).Why Logic? Logical columns are the foundation of Machine Learning (e.g., predicting if someone is “Old” or not).
Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.