We will learn the following:
1.3.1. What is an R DataFrame?
1.3.2. Creating R DataFrames
1.3.3. Common DataFrame methods
An R DataFrame is a two-dimensional, tabular data structure within the R programming language. It resembles a spreadsheet or a SQL table, with rows representing observations and columns representing variables or attributes. In a DataFrame, each column can contain different types of data, such as numeric, character, or factor. This versatile structure is widely used for organizing and manipulating structured data in R, providing a convenient way to perform data analysis and statistical operations.
Features of a DataFrame
Homogeneous Columns
Homogeneous or Heterogeneous Rows
Cells
Labeled Axes
Row Index
Column Index
This is illustrated in the figure below.
Creating DataFrames is a foundational skill in R, pivotal for
organizing and analyzing data efficiently. In R, you can create a
DataFrame directly using functions like data.frame(). The
DataFrame structure allows for the seamless organization of diverse data
types into a tabular format, facilitating straightforward data
manipulation and analysis.
We will learn the following:
1.3.2.1 From vectors
1.3.2.2 Using R editor
STEP I: Create vectors for each of the columns in the DataFrame.
id <- 101:110
gender <- c(
"Female", "Male", "Female", "Female", "Male",
"Female", "Male", "Male", "Female", "Male"
)
education <- c(
"Masters", "Bachelors", "Doctoral", "Masters", "Bachelors",
"Doctoral", "Bachelors", "Bachelors", "Masters", "Doctoral"
)
age <- c(32, 46, 43, 35, 36, 44, 30, 42, 30, 28)
income <- c(
528.87, 422.02, 466.81, 598.32, 510.16,
517.48, 399.02, 546.79, 467.69, 551.13
)
STEP II: Use the data.frame() function
to create a DataFrame from the above vectors.
df <- data.frame(
id, gender, education, age, income
)
View(df)
A DataFrame can be conveniently constructed by inputting values
through a spreadsheet-like editor in R. This process involves utilizing
the data.frame() function to initialize the DataFrame with
specified columns and data types. Subsequently,
the fix() function is employed to open a spreadsheet-like
editor.
STEP I: Initialize the DataFrame columns
df <- data.frame(
id = integer(),
gender = character(),
education = character(),
age = integer(),
income = double()
)
STEP II: Use fix() to pop-up the R
editor
fix(df)
STEP III: Enter the values
Enter the values in the spreadsheet that pops up. To save the data, simply close the window.
In this section we look at a diverse range of essential functionalities, streamlining the extraction of swift and pertinent information from the DataFrame.
Load dataset
df <- readRDS(file = "~/STEMResearch/datasets/hh_income.RDS")
View(df)
We will explore the following aspects of the DataFrame:
Dimensions
dim(df)
## [1] 10 9
# number of rows
nrow(df) # similar to dim(df)[1]
## [1] 10
# number of columns
ncol(df) # similar to dim(df)[2]
## [1] 9
Attributes
attributes(df)
## $names
## [1] "id" "race" "gender" "married" "education"
## [6] "age" "income" "expenditure" "ses"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
Column names
colnames(df)
## [1] "id" "race" "gender" "married" "education"
## [6] "age" "income" "expenditure" "ses"
Row names
rownames(df)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
row.names(df)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Adding comment
comment(df) <- "Household income for civil servants"
First n observations
head(df, n = 2)
## id race gender married education age income expenditure ses
## 1 1 White Female Yes Masters 32 528.87 417.77 High
## 2 2 Other Male Yes Bachelors 46 422.02 501.42 High
Last n observations
tail(df, n = 3)
## id race gender married education age income expenditure ses
## 8 8 Black Male No Bachelors 42 546.79 393.82 Low
## 9 9 Other Female Yes Masters 30 467.69 460.62 Low
## 10 10 Black Male Yes Doctoral 28 551.13 353.95 High
Data types
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 9 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10
## $ race : Ord.factor w/ 3 levels "White"<"Black"<..: 1 3 1 2 2 3 1 2 3 2
## $ gender : chr "Female" "Male" "Female" "Female" ...
## $ married : chr "Yes" "Yes" "No" "No" ...
## $ education : Ord.factor w/ 3 levels "Bachelors"<"Masters"<..: 2 1 3 2 1 3 1 1 2 3
## $ age : int 32 46 43 35 36 44 30 42 30 28
## $ income : num 529 422 467 598 510 ...
## $ expenditure: num 418 501 336 454 340 ...
## $ ses : Ord.factor w/ 3 levels "Low"<"Middle"<..: 3 3 2 1 2 2 1 1 1 3
## - attr(*, "comment")= chr "Household income for civil servants"
library(dplyr)
glimpse(df)
## Rows: 10
## Columns: 9
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ race <ord> White, Other, White, Black, Black, Other, White, Black, Ot…
## $ gender <chr> "Female", "Male", "Female", "Female", "Male", "Female", "M…
## $ married <chr> "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ education <ord> Masters, Bachelors, Doctoral, Masters, Bachelors, Doctoral…
## $ age <int> 32, 46, 43, 35, 36, 44, 30, 42, 30, 28
## $ income <dbl> 528.87, 422.02, 466.81, 598.32, 510.16, 517.48, 399.02, 54…
## $ expenditure <dbl> 417.77, 501.42, 336.07, 454.08, 340.33, 372.84, 348.82, 39…
## $ ses <ord> High, High, Middle, Low, Middle, Middle, Low, Low, Low, Hi…
Summarize
summary(df[, 6:9])
## age income expenditure ses
## Min. :28.00 Min. :399.0 Min. :336.1 Low :4
## 1st Qu.:30.50 1st Qu.:467.0 1st Qu.:350.1 Middle:3
## Median :35.50 Median :513.8 Median :383.3 High :3
## Mean :36.60 Mean :500.8 Mean :398.0
## 3rd Qu.:42.75 3rd Qu.:542.3 3rd Qu.:445.0
## Max. :46.00 Max. :598.3 Max. :501.4