##Objective
Learn how to import data into R from different file formats and sources including CSV files,Excel files,Text files,JSON files, XML files,Built in datasets,Online datasets
Step 1: Create a CSV file
Example file: students.csv
| ID | Name | Marks |
|---|---|---|
| 1 | Alice | 80 |
| 2 | John | 75 |
| 3 | David | 90 |
Step 2: Import CSV in R
Example text file: data.txt ID Name Score 1 Alice 78 2 John 85 3 David 90
Example using the built-in Titanic dataset.
# Load dataset
data(Titanic)
# Display dataset
Titanic
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
To learn how to merge datasets using common variables (keys) in R. # Part 1: Create Two Datasets Creating two simple datasets: ## Dataset 1: Students Info
students <- data.frame(
ID = c(1, 2, 3, 4),
Name = c("Alice", "John", "David", "Mary"),
Age = c(20, 21, 22, 23)
)
students
## ID Name Age
## 1 1 Alice 20
## 2 2 John 21
## 3 3 David 22
## 4 4 Mary 23
marks <- data.frame(
ID = c(1, 2, 3, 4),
Course = c("Math", "Math", "Math", "Math"),
Score = c(80, 75, 90, 85)
)
marks
## ID Course Score
## 1 1 Math 80
## 2 2 Math 75
## 3 3 Math 90
## 4 4 Math 85
Inner Join (only matching records)
merged_data <- merge(students, marks, by = "ID")
merged_data
## ID Name Age Course Score
## 1 1 Alice 20 Math 80
## 2 2 John 21 Math 75
## 3 3 David 22 Math 90
## 4 4 Mary 23 Math 85
Add Name to second dataset first
marks2 <- data.frame(
ID = c(1, 2, 3, 4),
Name = c("Alice", "John", "David", "Mary"),
Score = c(80, 75, 90, 85)
)
# Merge using two keys
merged_2keys <- merge(students, marks2, by = c("ID", "Name"))
merged_2keys
## ID Name Age Score
## 1 1 Alice 20 80
## 2 2 John 21 75
## 3 3 David 22 90
## 4 4 Mary 23 85
left_join <- merge(students, marks, by = "ID", all.x = TRUE)
left_join
## ID Name Age Course Score
## 1 1 Alice 20 Math 80
## 2 2 John 21 Math 75
## 3 3 David 22 Math 90
## 4 4 Mary 23 Math 85
full_join <- merge(students, marks, by = "ID", all = TRUE)
full_join
## ID Name Age Course Score
## 1 1 Alice 20 Math 80
## 2 2 John 21 Math 75
## 3 3 David 22 Math 90
## 4 4 Mary 23 Math 85
In summary,
| Type | Function |
|---|---|
| Inner Join | merge(x, y, by="ID") |
| Left Join | merge(x, y, by="ID", all.x=TRUE) |
| Full Join | merge(x, y, by="ID", all=TRUE) |
| Multi-key merge | by = c("ID","Name") |
Merging datasets in R helps combine information from different sources using:
One variable (ID) Two variables (ID + Name) Multiple join types (inner, left, full)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
students <- data.frame(
ID = c(1,2,3,4,5),
Name = c("Alice","John","David","Mary","Peter"),
Age = c(20,21,22,23,21),
Marks = c(80,75,90,85,70),
Department = c("IT","CS","IT","CS","IT")
)
students
## ID Name Age Marks Department
## 1 1 Alice 20 80 IT
## 2 2 John 21 75 CS
## 3 3 David 22 90 IT
## 4 4 Mary 23 85 CS
## 5 5 Peter 21 70 IT
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
library(dplyr)
students %>%
select(Name, Marks)
## Name Marks
## 1 Alice 80
## 2 John 75
## 3 David 90
## 4 Mary 85
## 5 Peter 70
students %>%
filter(Marks > 80)
## ID Name Age Marks Department
## 1 3 David 22 90 IT
## 2 4 Mary 23 85 CS
students %>%
arrange(desc(Marks))
## ID Name Age Marks Department
## 1 3 David 22 90 IT
## 2 4 Mary 23 85 CS
## 3 1 Alice 20 80 IT
## 4 2 John 21 75 CS
## 5 5 Peter 21 70 IT
students %>%
rename(Student_Name = Name, Score = Marks)
## ID Student_Name Age Score Department
## 1 1 Alice 20 80 IT
## 2 2 John 21 75 CS
## 3 3 David 22 90 IT
## 4 4 Mary 23 85 CS
## 5 5 Peter 21 70 IT
students %>%
mutate(Grade = ifelse(Marks >= 80, "A", "B"))
## ID Name Age Marks Department Grade
## 1 1 Alice 20 80 IT A
## 2 2 John 21 75 CS B
## 3 3 David 22 90 IT A
## 4 4 Mary 23 85 CS A
## 5 5 Peter 21 70 IT B
students %>%
group_by(Department) %>%
summarise(
avg_marks = mean(Marks),
total_students = n()
)
## # A tibble: 2 × 3
## Department avg_marks total_students
## <chr> <dbl> <int>
## 1 CS 80 2
## 2 IT 80 3
students %>%
filter(Marks > 70) %>%
mutate(Status = ifelse(Marks >= 80, "Pass", "Good")) %>%
arrange(desc(Marks)) %>%
select(Name, Marks, Status)
## Name Marks Status
## 1 David 90 Pass
## 2 Mary 85 Pass
## 3 Alice 80 Pass
## 4 John 75 Good
select() is used to choose columns filter() is used to choose rows arrange() is used to sort data rename() is used to change column names mutate() is used to create new variables group_by() is used to group data %>% is used to chain chain operations
In R, trace() and recover() are debugging tools used to inspect and fix errors in functions. ## 1. trace() in R trace() allows you to insert debugging code into an existing function without rewriting it. You can print messages or inspect variables when the function runs.
Example Step 1: Create a simple function
add_numbers <- function(a, b) {
return(a + b)
}
Step 2: Add trace
trace(add_numbers, tracer = quote({
cat("a =", a, "\n")
cat("b =", b, "\n")
}))
## [1] "add_numbers"
Step 3: Run function
add_numbers(5, 3)
## Tracing add_numbers(5, 3) on entry
## a = 5
## b = 3
## [1] 8
recover() is used to inspect the call stack when an error occurs. It lets you choose which function environment you want to explore after an error.
Step 1: Set recover mode
options(error = recover)
fun1 <- function(x) {
fun2(x)
}
fun2 <- function(x) {
x / 0 # This causes an error
}
fun1(10)
## [1] Inf
trace() is used to monitor function execution by inserting debugging code recover() is used to analyze errors by navigating function call stack after failure
This function calculates:
Mean Median Minimum Maximum Standard Deviation Variance Sample size (n)
summary_stats <- function(x) {
if (!is.numeric(x)) {
stop("Input must be a numeric vector")
}
result <- list(
Count = length(x),
Mean = mean(x, na.rm = TRUE),
Median = median(x, na.rm = TRUE),
Min = min(x, na.rm = TRUE),
Max = max(x, na.rm = TRUE),
Variance = var(x, na.rm = TRUE),
Std_Dev = sd(x, na.rm = TRUE)
)
return(result)
}
Example:Simple vector
data <- c(10, 20, 30, 40, 50)
summary_stats(data)
## $Count
## [1] 5
##
## $Mean
## [1] 30
##
## $Median
## [1] 30
##
## $Min
## [1] 10
##
## $Max
## [1] 50
##
## $Variance
## [1] 250
##
## $Std_Dev
## [1] 15.81139
Example for a Function for a Data Frame Column( involving datasets)
summary_column <- function(data, col_name) {
if (!is.data.frame(data)) {
stop("Data must be a data frame")
}
x <- data[[col_name]]
if (!is.numeric(x)) {
stop("Selected column must be numeric")
}
return(summary_stats(x))
}
df <- data.frame(
age = c(20, 21, 22, 23, 24),
score = c(80, 90, 70, 85, 95)
)
summary_column(df, "score")
## $Count
## [1] 5
##
## $Mean
## [1] 84
##
## $Median
## [1] 85
##
## $Min
## [1] 70
##
## $Max
## [1] 95
##
## $Variance
## [1] 92.5
##
## $Std_Dev
## [1] 9.617692
The function summary_stats() computes essential descriptive statistics for a numeric dataset including mean, median, minimum, maximum, variance, standard deviation, and count. It ensures input validation by checking whether the data is numeric. This function helps in quick exploratory data analysis.
These functions are useful for applying operations over vectors, lists, and grouped data. ## 1. lapply() — List Apply lapply() applies a function to each element of a list or vector and always returns a list
Example
nums <- list(a = 1:5, b = 6:10)
lapply(nums, mean)
## $a
## [1] 3
##
## $b
## [1] 8
It is used When you want results in list form and Safe and consistent output
sapply() is similar to lapply() but simplifies the output into a vector or matrix if possible.
Example
nums <- list(a = 1:5, b = 6:10)
sapply(nums, mean)
## a b
## 3 8
It is used when you want cleaner output than a list and it Useful for quick analysis
vapply() works like sapply() but requires you to specify the output type, making it safer and more predictable
Example
nums <- list(a = 1:5, b = 6:10)
vapply(nums, mean, numeric(1))
## a b
## 3 8
It is best for programming and assignments requiring reliability and Prevents unexpected output formats
mapply() applies a function to multiple input vectors at the same time.
Example
mapply(sum, 1:5, 6:10)
## [1] 7 9 11 13 15
It is used When working with multiple datasets or vectors simultaneously
split() divides data into groups based on a factor.
Example
x <- c(10, 20, 30, 40, 50)
group <- c("A", "A", "B", "B", "B")
split(x, group)
## $A
## [1] 10 20
##
## $B
## [1] 30 40 50
It is used for Grouping data before analysis
tapply() applies a function to subsets of a vector defined by a factor
`
Example
values <- c(10, 20, 30, 40, 50)
groups <- c("A", "A", "B", "B", "B")
tapply(values, groups, mean)
## A B
## 15 40
It is useful for Grouped statistical calculations (mean, sum, etc.)
map() is part of the purrr package and is a modern alternative to lapply().
`
Example
library(purrr)
nums <- list(1:5, 6:10)
map(nums, mean)
## [[1]]
## [1] 3
##
## [[2]]
## [1] 8
It is used for Modern, tidyverse-friendly coding and always returns a list
The apply family of functions in R provides efficient alternatives to loops for data processing. They are widely used in data science for cleaning, transforming, and analyzing data. Among them, vapply() is the safest for programming, while tapply() and split() are most useful for grouped analysis. The map() function is the modern standard in the tidyverse ecosystem.