Assignment 1:Importing data form different sources

##Objective

Learn how to import data into R from different file formats and sources including CSV files,Excel files,Text files,JSON files, XML files,Built in datasets,Online datasets

A: Importing CSV Files

Step 1: Create a CSV file

Example file: students.csv

ID Name Marks
1 Alice 80
2 John 75
3 David 90

Step 2: Import CSV in R

B: Importing Excel Files

C: Importing TXT Files

Example text file: data.txt ID Name Score 1 Alice 78 2 John 85 3 David 90

D: Importing JSON Files

Step 1: Install package

E: Importing XML Files

Step 1: Install package

Step 2: Import XML file

F: Importing Built-in Datasets

Example using the built-in Titanic dataset.

# Load dataset
data(Titanic)

# Display dataset
Titanic
## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

Part G: Importing Data from the Web

Assignment 2:Merging datasets in R

Objective

To learn how to merge datasets using common variables (keys) in R. # Part 1: Create Two Datasets Creating two simple datasets: ## Dataset 1: Students Info

students <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("Alice", "John", "David", "Mary"),
  Age = c(20, 21, 22, 23)
)

students
##   ID  Name Age
## 1  1 Alice  20
## 2  2  John  21
## 3  3 David  22
## 4  4  Mary  23

Dataset 2: Marks Info

marks <- data.frame(
  ID = c(1, 2, 3, 4),
  Course = c("Math", "Math", "Math", "Math"),
  Score = c(80, 75, 90, 85)
)

marks
##   ID Course Score
## 1  1   Math    80
## 2  2   Math    75
## 3  3   Math    90
## 4  4   Math    85

Part 2: Merge Using One Variable (ID)

Inner Join (only matching records)

merged_data <- merge(students, marks, by = "ID")

merged_data
##   ID  Name Age Course Score
## 1  1 Alice  20   Math    80
## 2  2  John  21   Math    75
## 3  3 David  22   Math    90
## 4  4  Mary  23   Math    85

Part 3: Merge Using 2 Variables (ID + Name Example)

Add Name to second dataset first

marks2 <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("Alice", "John", "David", "Mary"),
  Score = c(80, 75, 90, 85)
)

# Merge using two keys
merged_2keys <- merge(students, marks2, by = c("ID", "Name"))

merged_2keys
##   ID  Name Age Score
## 1  1 Alice  20    80
## 2  2  John  21    75
## 3  3 David  22    90
## 4  4  Mary  23    85

Part 4: Left Join (Keep all students)

left_join <- merge(students, marks, by = "ID", all.x = TRUE)

left_join
##   ID  Name Age Course Score
## 1  1 Alice  20   Math    80
## 2  2  John  21   Math    75
## 3  3 David  22   Math    90
## 4  4  Mary  23   Math    85

Part 5: Full Join (Keep all data)

full_join <- merge(students, marks, by = "ID", all = TRUE)

full_join
##   ID  Name Age Course Score
## 1  1 Alice  20   Math    80
## 2  2  John  21   Math    75
## 3  3 David  22   Math    90
## 4  4  Mary  23   Math    85

In summary,

Type Function
Inner Join merge(x, y, by="ID")
Left Join merge(x, y, by="ID", all.x=TRUE)
Full Join merge(x, y, by="ID", all=TRUE)
Multi-key merge by = c("ID","Name")

Merging datasets in R helps combine information from different sources using:

One variable (ID) Two variables (ID + Name) Multiple join types (inner, left, full)

Assignment3: Data Manipulation in R using dplyr

Dataset Example

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
students <- data.frame(
  ID = c(1,2,3,4,5),
  Name = c("Alice","John","David","Mary","Peter"),
  Age = c(20,21,22,23,21),
  Marks = c(80,75,90,85,70),
  Department = c("IT","CS","IT","CS","IT")
)

students
##   ID  Name Age Marks Department
## 1  1 Alice  20    80         IT
## 2  2  John  21    75         CS
## 3  3 David  22    90         IT
## 4  4  Mary  23    85         CS
## 5  5 Peter  21    70         IT

1. select() → Choose columns

install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
library(dplyr)
students %>%
  select(Name, Marks)
##    Name Marks
## 1 Alice    80
## 2  John    75
## 3 David    90
## 4  Mary    85
## 5 Peter    70

2. filter() → Select rows by condition

students %>%
  filter(Marks > 80)
##   ID  Name Age Marks Department
## 1  3 David  22    90         IT
## 2  4  Mary  23    85         CS

3. arrange() → Sort data

students %>%
  arrange(desc(Marks))
##   ID  Name Age Marks Department
## 1  3 David  22    90         IT
## 2  4  Mary  23    85         CS
## 3  1 Alice  20    80         IT
## 4  2  John  21    75         CS
## 5  5 Peter  21    70         IT

4. rename() → Change column names

students %>%
  rename(Student_Name = Name, Score = Marks)
##   ID Student_Name Age Score Department
## 1  1        Alice  20    80         IT
## 2  2         John  21    75         CS
## 3  3        David  22    90         IT
## 4  4         Mary  23    85         CS
## 5  5        Peter  21    70         IT

5. mutate() → Create new variables

students %>%
  mutate(Grade = ifelse(Marks >= 80, "A", "B"))
##   ID  Name Age Marks Department Grade
## 1  1 Alice  20    80         IT     A
## 2  2  John  21    75         CS     B
## 3  3 David  22    90         IT     A
## 4  4  Mary  23    85         CS     A
## 5  5 Peter  21    70         IT     B

6. group_by() + summarise() → Group analysis

students %>%
  group_by(Department) %>%
  summarise(
    avg_marks = mean(Marks),
    total_students = n()
  )
## # A tibble: 2 × 3
##   Department avg_marks total_students
##   <chr>          <dbl>          <int>
## 1 CS                80              2
## 2 IT                80              3

7. Pipe operator %>% → Combine everything

students %>%
  filter(Marks > 70) %>%
  mutate(Status = ifelse(Marks >= 80, "Pass", "Good")) %>%
  arrange(desc(Marks)) %>%
  select(Name, Marks, Status)
##    Name Marks Status
## 1 David    90   Pass
## 2  Mary    85   Pass
## 3 Alice    80   Pass
## 4  John    75   Good

To summarize

select() is used to choose columns filter() is used to choose rows arrange() is used to sort data rename() is used to change column names mutate() is used to create new variables group_by() is used to group data %>% is used to chain chain operations

Assignment 4: How to use trace() and recovery()

In R, trace() and recover() are debugging tools used to inspect and fix errors in functions. ## 1. trace() in R trace() allows you to insert debugging code into an existing function without rewriting it. You can print messages or inspect variables when the function runs.

Example Step 1: Create a simple function

add_numbers <- function(a, b) {
  return(a + b)
}

Step 2: Add trace

trace(add_numbers, tracer = quote({
  cat("a =", a, "\n")
  cat("b =", b, "\n")
}))
## [1] "add_numbers"

Step 3: Run function

add_numbers(5, 3)
## Tracing add_numbers(5, 3) on entry 
## a = 5 
## b = 3
## [1] 8

2. recover() in R

recover() is used to inspect the call stack when an error occurs. It lets you choose which function environment you want to explore after an error.

Example

Step 1: Set recover mode

options(error = recover)

Step 2: Create nested functions

fun1 <- function(x) {
  fun2(x)
}

fun2 <- function(x) {
  x / 0   # This causes an error
}

Step 3: Run function

fun1(10)
## [1] Inf

In summary

trace() is used to monitor function execution by inserting debugging code recover() is used to analyze errors by navigating function call stack after failure

Assignment 5: To create a function to find summmary statistics

1. Function: Summary Statistics in R

This function calculates:

Mean Median Minimum Maximum Standard Deviation Variance Sample size (n)

summary_stats <- function(x) {
  if (!is.numeric(x)) {
    stop("Input must be a numeric vector")
  }
  
  result <- list(
    Count = length(x),
    Mean = mean(x, na.rm = TRUE),
    Median = median(x, na.rm = TRUE),
    Min = min(x, na.rm = TRUE),
    Max = max(x, na.rm = TRUE),
    Variance = var(x, na.rm = TRUE),
    Std_Dev = sd(x, na.rm = TRUE)
  )
  
  return(result)
}

Example:Simple vector

data <- c(10, 20, 30, 40, 50)

summary_stats(data)
## $Count
## [1] 5
## 
## $Mean
## [1] 30
## 
## $Median
## [1] 30
## 
## $Min
## [1] 10
## 
## $Max
## [1] 50
## 
## $Variance
## [1] 250
## 
## $Std_Dev
## [1] 15.81139

Example for a Function for a Data Frame Column( involving datasets)

summary_column <- function(data, col_name) {
  if (!is.data.frame(data)) {
    stop("Data must be a data frame")
  }
  
  x <- data[[col_name]]
  
  if (!is.numeric(x)) {
    stop("Selected column must be numeric")
  }
  
  return(summary_stats(x))
}
df <- data.frame(
  age = c(20, 21, 22, 23, 24),
  score = c(80, 90, 70, 85, 95)
)

summary_column(df, "score")
## $Count
## [1] 5
## 
## $Mean
## [1] 84
## 
## $Median
## [1] 85
## 
## $Min
## [1] 70
## 
## $Max
## [1] 95
## 
## $Variance
## [1] 92.5
## 
## $Std_Dev
## [1] 9.617692

In Summary,

The function summary_stats() computes essential descriptive statistics for a numeric dataset including mean, median, minimum, maximum, variance, standard deviation, and count. It ensures input validation by checking whether the data is numeric. This function helps in quick exploratory data analysis.

Assignment 6:How to sapply(),vapply(),lappy(),map() and mapply(),split() and tapply () in R

These functions are useful for applying operations over vectors, lists, and grouped data. ## 1. lapply() — List Apply lapply() applies a function to each element of a list or vector and always returns a list

Example

nums <- list(a = 1:5, b = 6:10)
lapply(nums, mean)
## $a
## [1] 3
## 
## $b
## [1] 8

It is used When you want results in list form and Safe and consistent output

2. sapply() — Simplified Apply

sapply() is similar to lapply() but simplifies the output into a vector or matrix if possible.

Example

nums <- list(a = 1:5, b = 6:10)
sapply(nums, mean)
## a b 
## 3 8

It is used when you want cleaner output than a list and it Useful for quick analysis

3. vapply() — Verified Apply

vapply() works like sapply() but requires you to specify the output type, making it safer and more predictable

Example

nums <- list(a = 1:5, b = 6:10)
vapply(nums, mean, numeric(1))
## a b 
## 3 8

It is best for programming and assignments requiring reliability and Prevents unexpected output formats

4. mapply() — Multiple Apply

mapply() applies a function to multiple input vectors at the same time.

Example

mapply(sum, 1:5, 6:10)
## [1]  7  9 11 13 15

It is used When working with multiple datasets or vectors simultaneously

5. split() — Data Grouping

split() divides data into groups based on a factor.

Example

x <- c(10, 20, 30, 40, 50)
group <- c("A", "A", "B", "B", "B")

split(x, group)
## $A
## [1] 10 20
## 
## $B
## [1] 30 40 50

It is used for Grouping data before analysis

6.tapply() — Apply Function Over Groups

tapply() applies a function to subsets of a vector defined by a factor

`

Example

values <- c(10, 20, 30, 40, 50)
groups <- c("A", "A", "B", "B", "B")

tapply(values, groups, mean)
##  A  B 
## 15 40

It is useful for Grouped statistical calculations (mean, sum, etc.)

7. map() — Modern Apply Function (purrr package)

map() is part of the purrr package and is a modern alternative to lapply().

`

Example

library(purrr)

nums <- list(1:5, 6:10)
map(nums, mean)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 8

It is used for Modern, tidyverse-friendly coding and always returns a list

Conclusion

The apply family of functions in R provides efficient alternatives to loops for data processing. They are widely used in data science for cleaning, transforming, and analyzing data. Among them, vapply() is the safest for programming, while tapply() and split() are most useful for grouped analysis. The map() function is the modern standard in the tidyverse ecosystem.