Systemic Data Handling and Type Conversion

Author

Abdullah Al Shamim

In R, every column has a Class (Data Type). If the class is wrong (e.g., a “Size” category being treated as plain text), your analysis and plots will be incorrect. We will use the friends.csv dataset to demonstrate this transformation.

Phase 1: Data Acquisition and Initial Inquiry

The first step is bringing the data into R and performing a “Structural Audit.”

We use read.csv() to fetch data from a URL.

Code
file_url <- "https://raw.githubusercontent.com/ShamimtheAnalyst/YouTube-Repo/refs/heads/main/friends.csv"

friends <- read.csv(file_url)

Before changing anything, we must see how R automatically interpreted the columns.

Code
# Open the dataset in a spreadsheet viewer
View(friends) 
Code
# Examine the "Skeleton" of the data
str(friends)
'data.frame':   10 obs. of  6 variables:
 $ Name     : chr  "Rahim" "Karim" "Badsha" "Rafiq" ...
 $ Sex      : chr  "Male" "Male" "Male" "Male" ...
 $ Eye.color: chr  "Brown" "Blue" "Black" "Brown" ...
 $ Age      : int  34 33 40 36 43 44 52 29 43 51
 $ Height   : num  1.77 1.76 1.87 1.57 1.98 1.66 1.65 1.99 1.55 1.88
 $ Size     : chr  "Medium" "Medium" "Tall" "Short" ...

A systemic look at the distribution, means, and missing values of the dataset.

Code
# Statistical snapshot of the entire dataset
summary(friends)
     Name               Sex             Eye.color              Age       
 Length:10          Length:10          Length:10          Min.   :29.00  
 Class :character   Class :character   Class :character   1st Qu.:34.50  
 Mode  :character   Mode  :character   Mode  :character   Median :41.50  
                                                          Mean   :40.50  
                                                          3rd Qu.:43.75  
                                                          Max.   :52.00  
     Height          Size          
 Min.   :1.550   Length:10         
 1st Qu.:1.653   Class :character  
 Median :1.765   Mode  :character  
 Mean   :1.768                     
 3rd Qu.:1.877                     
 Max.   :1.990                     

Phase 2: Variable Type Transformation (Refining the Data)

Often, R imports categories as Character (text) and numbers as Numeric (decimals). For better memory management and analysis, we must convert them to Factor and Integer.

The Size column contains categories (Short, Medium, Tall). We convert it to a Factor so R treats it as a grouping variable.

Code
# Convert to factor
friends$Size <- as.factor(friends$Size)
Code
# Verify the change
str(friends$Size)
 Factor w/ 3 levels "Medium","Short",..: 1 1 3 2 3 2 2 3 2 3

Age is a whole number. Converting Numeric to Integer saves memory and ensures precision in mathematical operations.

Code
# Convert Age to integer
friends$Age <- as.integer(friends$Age)

# Verify the change
str(friends$Age)
 int [1:10] 34 33 40 36 43 44 52 29 43 51

Phase 3: Advanced Factor Control (Levels & Ordering)

By default, R sorts factors alphabetically (Medium, Short, Tall). However, “Size” has a logical order. We must manually define the Levels.

Code
# See how R currently orders the categories
levels(friends$Size)
[1] "Medium" "Short"  "Tall"  

We tell R that the logical flow is Short \(\rightarrow\) Medium \(\rightarrow\) Tall.

Code
# Re-define the factor with a custom order
friends$Size <- factor(friends$Size, levels = c("Short", "Medium", "Tall"))
Code
# Check the levels again - now they are logically ordered
levels(friends$Size)
[1] "Short"  "Medium" "Tall"  

Phase 4: Feature Engineering (Creating Logical Rules)

We can create new columns based on logical conditions. These are called Logical Vectors (TRUE or FALSE).

Let’s create a column called Old for anyone over the age of 40.

Code
# If Age > 40, R returns TRUE; otherwise FALSE
friends$Old <- friends$Age > 40

# Check the class of the new column
class(friends$Old)
[1] "logical"

Now, look at the structure again to see how much more professional the dataset looks.

Code
str(friends)
'data.frame':   10 obs. of  7 variables:
 $ Name     : chr  "Rahim" "Karim" "Badsha" "Rafiq" ...
 $ Sex      : chr  "Male" "Male" "Male" "Male" ...
 $ Eye.color: chr  "Brown" "Blue" "Black" "Brown" ...
 $ Age      : int  34 33 40 36 43 44 52 29 43 51
 $ Height   : num  1.77 1.76 1.87 1.57 1.98 1.66 1.65 1.99 1.55 1.88
 $ Size     : Factor w/ 3 levels "Short","Medium",..: 2 2 3 1 3 1 1 3 1 3
 $ Old      : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

🎓 Systemic Summary for Learners

Function Purpose Beginner Tip
str() Shows structure Always use this first. It is the “X-ray” of your data.
as.factor() Creates categories Use this for columns like Gender, Color, or Rank.
levels() Checks group order Essential for ensuring your Bar Charts don’t look messy.
Logical (>) Creates flags Used for filtering or creating “Binary” (Yes/No) variables.

Pro-Learner Insight: The “Why”

  • Why Factors? When you run a regression or create a plot, R uses Factors to create separate bars or group colors.

  • Why Integers? Using integers for columns like Age or ID prevents R from calculating meaningless decimals (like age 34.00000001).

  • Why Logic? Logical columns are the foundation of Machine Learning (e.g., predicting if someone is “Old” or not).

Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.