R is a powerful language and environment for statistical computing and graphics. At the heart of its data manipulation capabilities lies the data frame, a fundamental data structure analogous to a table in a relational database or a spreadsheet. This report provides a detailed overview of data frames in R, covering their creation, manipulation, and common operations.

1 What is a Data Frame?

A data frame is a list of vectors of equal length. Think of it as a table where:

Key characteristics of a data frame:

2 Creating Data Frames

Several ways exist to create data frames in R:

# Creating a data frame
name <- c("Alice", "Bob", "Charlie")
age <- c(25, 30, 28)
city <- c("New York", "London", "Paris")

df <- data.frame(name, age, city)
print(df)
# Reading data from a CSV file
df_from_csv <- read.csv("my_data.csv")
# Converting a list to a data frame
my_list <- list(names = c("Alice", "Bob"), ages = c(25, 30))
df_from_list <- as.data.frame(my_list)

3 Accessing and Manipulating Data

Once we have a data frame, we can access and manipulate its data in various ways:

4 Common Operations on Data Frames

R provides numerous functions for working with data frames:

5 Factors in Data Frames

Factors are a special data type in R used to represent categorical variables. They are crucial for statistical modeling. When a character vector is converted to a data frame, R often automatically converts it to a factor. We can control this behavior using the stringsAsFactors argument in data.frame() or read.csv(). It’s often good practice to explicitly convert character vectors to factors when needed using the factor() function.

6 Working with Missing Data

Missing values are represented by NA in R. Several functions are useful for handling missing data:

7 Example: Data Analysis with Data Frames

# Sample data frame
data <- data.frame(
  gender = factor(c("Male", "Female", "Male", "Female")),
  age = c(25, 30, 22, 28),
  income = c(50000, 60000, 45000, 55000)
)

# Calculate average income by gender
aggregate(income ~ gender, data = data, FUN = mean)

# Subset data for females older than 25
female_over_25 <- subset(data, gender == "Female" & age > 25)

print(female_over_25)

8 Conclusion

Data frames are an indispensable tool in R for data analysis and manipulation. Understanding their structure, creation, and manipulation techniques is essential for effectively working with data in R. This report has provided a comprehensive overview of data frames, covering their key features and common operations. By mastering these concepts, we can leverage the full power of R for our data-driven projects.