GitHub - https://github.com/rickysoo/dataframe_r
Contact - ricky [at] rickysoo.com
A data frame is a two-dimensional data structure in R language, consisting of rows and columns. It is a special case of the List data structure.
The anatomy of a data frame
Picture Source: https://www.geeksforgeeks.org/dataframe-operations-in-r/
A data frame can be created from vectors using the c() function to combine data items of the same type. In the example below, a data frame is created from a character vector, a numeric vector, and a date vector.
members <- data.frame(
Name = c('Ricky', 'Fatimah', 'Kumanan', 'Jamaine'), # Character vector
Height = c(170, 172, 180, 168), # Numeric vector
Birthday = as.Date(c('1990-01-01', '1991-02-02', '1993-03-03', '1994-04-04')) # Date vector
)
print(members)
## Name Height Birthday
## 1 Ricky 170 1990-01-01
## 2 Fatimah 172 1991-02-02
## 3 Kumanan 180 1993-03-03
## 4 Jamaine 168 1994-04-04
A data frame can be created by converting from lists.
list_numbers <- list('Column 1' = 1:4, 'Column 2' = 5:8, 'Column 3' = 9:12)
df_numbers <- as.data.frame(list_numbers)
df_numbers
## Column.1 Column.2 Column.3
## 1 1 5 9
## 2 2 6 10
## 3 3 7 11
## 4 4 8 12
A data frame can be created by converting from a matrix.
matrix_numbers <- matrix(1:12, nrow = 4, ncol = 3)
df_numbers <- as.data.frame(matrix_numbers)
colnames(df_numbers) <- c('Column 1', 'Column 2', 'Column 3') # Assign column names
df_numbers
## Column 1 Column 2 Column 3
## 1 1 5 9
## 2 2 6 10
## 3 3 7 11
## 4 4 8 12
A data frame can be created by importing data from a file or on the web, such as a comma-separated values (CSV) file using the read.csv() function.
# Import from a CSV file on local computer
audiobooks <- read.csv('audiobooks.csv')
# Import the same CSV file from the web
audiobooks <- read.csv('https://raw.githubusercontent.com/rickysoo/top_audiobooks/main/TopAudiobooks-20201107-122322.csv')
head(audiobooks) # Show first 6 rows
## Rank Title Subtitle
## 1 1 Greenlights <NA>
## 2 2 The Ice Diaries The Untold Story of the Cold War's Most Daring Mission
## 3 3 Tiny Habits The Small Changes That Change Everything
## 4 4 A Time for Mercy A Jack Brigance Novel
## 5 5 The Sentinel A Jack Reacher Novel
## 6 6 Clanlands Whisky, Warfare, and a Scottish Adventure Like No Other
## Author
## 1 Matthew McConaughey
## 2 Captain William R. Anderson, Don Keith - contributor
## 3 BJ Fogg PhD
## 4 John Grisham
## 5 Lee Child, Andrew Child
## 6 Sam Heughan, Graham McTavish, Diana Gabaldon - foreword
## Narrator Length
## 1 Matthew McConaughey 6 hrs and 42 mins
## 2 Roger Mueller 10 hrs and 1 min
## 3 BJ Fogg PhD 11 hrs and 22 mins
## 4 Michael Beck 19 hrs and 59 mins
## 5 Scott Brick 10 hrs and 39 mins
## 6 Graham McTavish, Sam Heughan, Diana Gabaldon - foreword 10 hrs and 22 mins
## Release Language Stars Ratings Price
## 1 10-20-20 English 5 out of 5 stars 7,172 ratings $28.00
## 2 10-15-19 English 5 out of 5 stars 6 ratings $30.79
## 3 01-14-20 English 4.5 out of 5 stars 742 ratings $34.95
## 4 10-13-20 English 5 out of 5 stars 3,534 ratings $31.50
## 5 10-27-20 English 4.5 out of 5 stars 1,394 ratings $31.50
## 6 11-03-20 English 5 out of 5 stars 158 ratings $21.81
There are a number of functions to show the characteristics of a data frame.
# Data frame is a special case of list
typeof(members)
## [1] "list"
# The class is data.frame
class(members)
## [1] "data.frame"
# Check if "members" is a data frame
is.data.frame(members)
## [1] TRUE
# Number of columns
ncol(members)
## [1] 3
# Column names
names(members)
## [1] "Name" "Height" "Birthday"
# Number of rows
nrow(members)
## [1] 4
# Row names
row.names(members)
## [1] "1" "2" "3" "4"
# The dimension
dim(members)
## [1] 4 3
# The row and column names
dimnames(members)
## [[1]]
## [1] "1" "2" "3" "4"
##
## [[2]]
## [1] "Name" "Height" "Birthday"
# The list of attributes
attributes(members)
## $names
## [1] "Name" "Height" "Birthday"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4
# Internal structure of the data frame
str(members)
## 'data.frame': 4 obs. of 3 variables:
## $ Name : chr "Ricky" "Fatimah" "Kumanan" "Jamaine"
## $ Height : num 170 172 180 168
## $ Birthday: Date, format: "1990-01-01" "1991-02-02" ...
A number of functions can be used to show the whole or part of the data frame in order to examine the data.
print(members) # Print the whole data frame
## Name Height Birthday
## 1 Ricky 170 1990-01-01
## 2 Fatimah 172 1991-02-02
## 3 Kumanan 180 1993-03-03
## 4 Jamaine 168 1994-04-04
View(members) # View the data frame in data viewer in RStudio
head(audiobooks, n = 3) # Show the first 3 rows of the data frame. The default is 6 rows.
## Rank Title Subtitle
## 1 1 Greenlights <NA>
## 2 2 The Ice Diaries The Untold Story of the Cold War's Most Daring Mission
## 3 3 Tiny Habits The Small Changes That Change Everything
## Author Narrator
## 1 Matthew McConaughey Matthew McConaughey
## 2 Captain William R. Anderson, Don Keith - contributor Roger Mueller
## 3 BJ Fogg PhD BJ Fogg PhD
## Length Release Language Stars Ratings Price
## 1 6 hrs and 42 mins 10-20-20 English 5 out of 5 stars 7,172 ratings $28.00
## 2 10 hrs and 1 min 10-15-19 English 5 out of 5 stars 6 ratings $30.79
## 3 11 hrs and 22 mins 01-14-20 English 4.5 out of 5 stars 742 ratings $34.95
tail(audiobooks, n = 3) # Show the last 3 rows of the data frame. The default is 6 rows.
## Rank Title
## 98 98 If You Tell
## 99 99 Think and Grow Rich
## 100 100 The Housekeeper
## Subtitle
## 98 A True Story of Murder, Family Secrets, and the Unbreakable Bond of Sisterhood
## 99 <NA>
## 100 A Twisted Psychological Thriller
## Author Narrator Length Release Language
## 98 Gregg Olsen Karen Peakes 10 hrs and 34 mins 12-01-19 English
## 99 Napoleon Hill Erik Synnestvedt 9 hrs and 35 mins 10-16-07 English
## 100 Natalie Barelli Susie Berneis 8 hrs and 4 mins 01-02-20 English
## Stars Ratings Price
## 98 4.5 out of 5 stars 11,827 ratings $29.99
## 99 4.5 out of 5 stars 21,869 ratings $24.95
## 100 4 out of 5 stars 6,503 ratings $34.99
The “dplyr” package provides some useful functions for showing data frame.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sample_n(audiobooks, size = 3) # Sample 3 rows randomly from the dataframe
## Rank Title
## 1 83 A Kingdom of Flesh and Fire
## 2 78 The Practice
## 3 18 Harry Potter and the Sorcerer's Stone, Book 1
## Subtitle Author
## 1 A Blood and Ash Novel (Blood and Ash, Book 2) Jennifer L. Armentrout
## 2 Shipping Creative Work Seth Godin
## 3 <NA> J.K. Rowling
## Narrator Length Release Language Stars
## 1 Stina Nielsen 24 hrs and 21 mins 11-03-20 English 5 out of 5 stars
## 2 Seth Godin 5 hrs and 38 mins 11-03-20 English 5 out of 5 stars
## 3 Jim Dale 8 hrs and 18 mins 11-20-15 English 5 out of 5 stars
## Ratings Price
## 1 69 ratings $30.09
## 2 21 ratings $28.00
## 3 123,702 ratings $29.99
sample_frac(audiobooks, size = 0.05) ## Sample 5% of the rows randomly from the dataframe
## Rank Title Subtitle Author
## 1 82 Daylight <NA> David Baldacci
## 2 57 The Sandman <NA> Neil Gaiman, Dirk Maggs
## 3 47 Sorry I Missed You <NA> Suzy Krause
## 4 33 Midnight Sun <NA> Stephenie Meyer
## 5 96 Then She Was Gone A Novel Lisa Jewell
## Narrator
## 1 Brittany Pressley, Kyf Brewer
## 2 Riz Ahmed, Kat Dennings, Taron Egerton, Neil Gaiman, James McAvoy, Samantha Morton, Bebe Neuwirth, Andy Serkis, Michael Sheen
## 3 Amanda Ronconi
## 4 Jake Abel
## 5 Helen Duff
## Length Release Language Stars Ratings Price
## 1 11 hrs and 37 mins 11-17-20 English <NA> Not rated yet $30.79
## 2 10 hrs and 54 mins 07-15-20 English 4.5 out of 5 stars 19,230 ratings $34.95
## 3 9 hrs and 17 mins 06-01-20 English 4.5 out of 5 stars 406 ratings $25.19
## 4 25 hrs and 49 mins 08-04-20 English 4.5 out of 5 stars 11,835 ratings $34.21
## 5 10 hrs and 12 mins 04-17-18 English 4.5 out of 5 stars 53,332 ratings $34.99
A summary of the data frame can be shown using the summary function. For character variables, it shows the mode among others. For numeric and date variables, it shows the mean, the minimum, the maximum, the median and the quartiles.
summary(members)
## Name Height Birthday
## Length:4 Min. :168.0 Min. :1990-01-01
## Class :character 1st Qu.:169.5 1st Qu.:1990-10-25
## Mode :character Median :171.0 Median :1992-02-17
## Mean :172.5 Mean :1992-02-17
## 3rd Qu.:174.0 3rd Qu.:1993-06-10
## Max. :180.0 Max. :1994-04-04
Select a single elements from a data frame by using the row and column indexes.
data <- members[1, 1] # Row 1, column 1
print(data)
## [1] "Ricky"
Select a single elements from a data frame by using double brackets notation.
data <- members[[1]][1] # Row 1, column 1
print(data)
## [1] "Ricky"
Select a single elements from a data frame by using column name in double brackets.
data <- members[['Name']][1] # Row 1, column 1
print(data)
## [1] "Ricky"
Select multiple elements from a data frame by using the row and column indexes.
data <- members[1:2, 1:2] # Rows 1 and 2, columns 1 and 2
print(data)
## Name Height
## 1 Ricky 170
## 2 Fatimah 172
Column names can be used in selecting the columns.
data <- members[1:2, c('Name', 'Height')] # Rows 1 and 2, columns "Name" and "Height"
print(data)
## Name Height
## 1 Ricky 170
## 2 Fatimah 172
Select a single row from a data frame by using the row index.
data <- members[1, ] # Row 1
print(data)
## Name Height Birthday
## 1 Ricky 170 1990-01-01
Select multiple rows from a data frame by using the row numbers.
data <- members[1:2, ] # Rows 1 and 2
print(data)
## Name Height Birthday
## 1 Ricky 170 1990-01-01
## 2 Fatimah 172 1991-02-02
data <- members[c(1, 3), ] # Rows 1 and 3
print(data)
## Name Height Birthday
## 1 Ricky 170 1990-01-01
## 3 Kumanan 180 1993-03-03
Select a single column from a data frame by using the column index.
data <- members[ , 1] # Column 1
print(data)
## [1] "Ricky" "Fatimah" "Kumanan" "Jamaine"
Select multiple columns from a data frame by using the column indexes.
data <- members[ , 1:2] # Columns 1 and 2
print(data)
## Name Height
## 1 Ricky 170
## 2 Fatimah 172
## 3 Kumanan 180
## 4 Jamaine 168
data <- members[, c(1, 3)] # Columns 1 and 3
print(data)
## Name Birthday
## 1 Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04
Column names can be used in selecting the columns.
data <- members[ , c('Name', 'Birthday')] # Columns "Name" and "Birthday"
print(data)
## Name Birthday
## 1 Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04
A column can be selected using the column name in bracket.
data <- members['Name']
print(data)
## Name
## 1 Ricky
## 2 Fatimah
## 3 Kumanan
## 4 Jamaine
Data can be selected using logical vectors.
data <- members[c(T, F, T, F), c(T, T, F)] # Select rows 1 and 3, and columns 1 and 2
print(data)
## Name Height
## 1 Ricky 170
## 3 Kumanan 180
A column can be selected using the format dataframe#column.
data <- members$Name
print(data)
## [1] "Ricky" "Fatimah" "Kumanan" "Jamaine"
Data can be conditionally selected by including a condition in bracket.
data <- members[members$Height > 170, ] # Show members with height more than 170cm
data
## Name Height Birthday
## 2 Fatimah 172 1991-02-02
## 3 Kumanan 180 1993-03-03
Data can be conditionally selected by using the subset() function.
data <- subset(members, members$Height > 170) # Show members with height more than 170cm
data
## Name Height Birthday
## 2 Fatimah 172 1991-02-02
## 3 Kumanan 180 1993-03-03
When single brackets [] are used, the data is returned as a dataframe.
data <- members[1]
class(data)
## [1] "data.frame"
data <- members['Name']
class(data)
## [1] "data.frame"
When double brackets [[]] are used, the data is returned as a vector.
data <- members[[1]]
class(data)
## [1] "character"
data <- members[['Name']]
class(data)
## [1] "character"
When the $ notation is used, the data is returned as a vector.
data <- members$Name
class(data)
## [1] "character"
Selecting a single row returns a data frame.
data <- members[1, ]
class(data)
## [1] "data.frame"
Use “drop = TRUE” when selecting a row to return a vector instead of a data frame.
data <- members[1, , drop = TRUE]
class(data)
## [1] "list"
Selecting a single column returns a vector.
data <- members[ , 1]
class(data)
## [1] "character"
Use “drop = FALSE” when selecting a column to return a data frame instead of a vector.
data <- members[, 1, drop = FALSE] # Set drop = FALSE
class(data)
## [1] "data.frame"
Selecting a single element returns a vector.
data <- members[1, 1]
class(data)
## [1] "character"
Use “drop = FALSE” when selecting an element to return a data frame instead of a vector.
data <- members[1, 1, drop = FALSE]
class(data)
## [1] "data.frame"
Data can be sorted using the order() function given a column name.
height_order <- order(members$Height)
print(height_order)
## [1] 4 1 2 3
members[height_order, ]
## Name Height Birthday
## 4 Jamaine 168 1994-04-04
## 1 Ricky 170 1990-01-01
## 2 Fatimah 172 1991-02-02
## 3 Kumanan 180 1993-03-03
Quickly visualize the data in a data frame using the plot function!
plot(members)
Don’t forget to save any updated data frame to a CSV file by using the write.csv() function.
write.csv(members, 'members.csv')
And finally, It’s….
The end!
Picture Source: https://www.pexels.com