Not-So-Simple Introduction to Data Frame in R

GitHub - https://github.com/rickysoo/dataframe_r
Contact - ricky [at] rickysoo.com

1. What is Data Frame?

A data frame is a two-dimensional data structure in R language, consisting of rows and columns. It is a special case of the List data structure.

The anatomy of a data frame

Picture Source: https://www.geeksforgeeks.org/dataframe-operations-in-r/

2. Creating Data Frame

Creating Data Frame from Vectors

A data frame can be created from vectors using the c() function to combine data items of the same type. In the example below, a data frame is created from a character vector, a numeric vector, and a date vector.

members <- data.frame(
  Name = c('Ricky', 'Fatimah', 'Kumanan', 'Jamaine'), # Character vector
  Height = c(170, 172, 180, 168), # Numeric vector
  Birthday = as.Date(c('1990-01-01', '1991-02-02', '1993-03-03', '1994-04-04')) # Date vector
)

print(members)

##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03
## 4 Jamaine    168 1994-04-04

Creating Data Frame from Lists

A data frame can be created by converting from lists.

list_numbers <- list('Column 1' = 1:4, 'Column 2' = 5:8, 'Column 3' = 9:12)
df_numbers <- as.data.frame(list_numbers)
df_numbers

##   Column.1 Column.2 Column.3
## 1        1        5        9
## 2        2        6       10
## 3        3        7       11
## 4        4        8       12

Creating Data Frame from Matrix

A data frame can be created by converting from a matrix.

matrix_numbers <- matrix(1:12, nrow = 4, ncol = 3)
df_numbers <- as.data.frame(matrix_numbers)
colnames(df_numbers) <- c('Column 1', 'Column 2', 'Column 3') # Assign column names
df_numbers

##   Column 1 Column 2 Column 3
## 1        1        5        9
## 2        2        6       10
## 3        3        7       11
## 4        4        8       12

Creating Data Frame from File

A data frame can be created by importing data from a file or on the web, such as a comma-separated values (CSV) file using the read.csv() function.

# Import from a CSV file on local computer
audiobooks <- read.csv('audiobooks.csv')

# Import the same CSV file from the web
audiobooks <- read.csv('https://raw.githubusercontent.com/rickysoo/top_audiobooks/main/TopAudiobooks-20201107-122322.csv')

head(audiobooks) # Show first 6 rows

##   Rank            Title                                                Subtitle
## 1    1      Greenlights                                                    <NA>
## 2    2  The Ice Diaries  The Untold Story of the Cold War's Most Daring Mission
## 3    3      Tiny Habits                The Small Changes That Change Everything
## 4    4 A Time for Mercy                                   A Jack Brigance Novel
## 5    5     The Sentinel                                    A Jack Reacher Novel
## 6    6        Clanlands Whisky, Warfare, and a Scottish Adventure Like No Other
##                                                    Author
## 1                                     Matthew McConaughey
## 2    Captain William R. Anderson, Don Keith - contributor
## 3                                             BJ Fogg PhD
## 4                                            John Grisham
## 5                                 Lee Child, Andrew Child
## 6 Sam Heughan, Graham McTavish, Diana Gabaldon - foreword
##                                                  Narrator             Length
## 1                                     Matthew McConaughey  6 hrs and 42 mins
## 2                                           Roger Mueller   10 hrs and 1 min
## 3                                             BJ Fogg PhD 11 hrs and 22 mins
## 4                                            Michael Beck 19 hrs and 59 mins
## 5                                             Scott Brick 10 hrs and 39 mins
## 6 Graham McTavish, Sam Heughan, Diana Gabaldon - foreword 10 hrs and 22 mins
##    Release Language              Stars       Ratings  Price
## 1 10-20-20  English   5 out of 5 stars 7,172 ratings $28.00
## 2 10-15-19  English   5 out of 5 stars     6 ratings $30.79
## 3 01-14-20  English 4.5 out of 5 stars   742 ratings $34.95
## 4 10-13-20  English   5 out of 5 stars 3,534 ratings $31.50
## 5 10-27-20  English 4.5 out of 5 stars 1,394 ratings $31.50
## 6 11-03-20  English   5 out of 5 stars   158 ratings $21.81

3. Exploring Data Frame

There are a number of functions to show the characteristics of a data frame.

# Data frame is a special case of list
typeof(members)

## [1] "list"

# The class is data.frame
class(members)

## [1] "data.frame"

# Check if "members" is a data frame
is.data.frame(members)

## [1] TRUE

# Number of columns
ncol(members)

## [1] 3

# Column names
names(members)

## [1] "Name"     "Height"   "Birthday"

# Number of rows
nrow(members)

## [1] 4

# Row names
row.names(members)

## [1] "1" "2" "3" "4"

# The dimension
dim(members)

## [1] 4 3

# The row and column names
dimnames(members)

## [[1]]
## [1] "1" "2" "3" "4"
## 
## [[2]]
## [1] "Name"     "Height"   "Birthday"

# The list of attributes
attributes(members)

## $names
## [1] "Name"     "Height"   "Birthday"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4

# Internal structure of the data frame
str(members)

## 'data.frame':    4 obs. of  3 variables:
##  $ Name    : chr  "Ricky" "Fatimah" "Kumanan" "Jamaine"
##  $ Height  : num  170 172 180 168
##  $ Birthday: Date, format: "1990-01-01" "1991-02-02" ...

4. Showing Data Frame

Base Package Functions

A number of functions can be used to show the whole or part of the data frame in order to examine the data.

print(members) # Print the whole data frame

##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03
## 4 Jamaine    168 1994-04-04

View(members) # View the data frame in data viewer in RStudio

head(audiobooks, n = 3) # Show the first 3 rows of the data frame. The default is 6 rows.

##   Rank           Title                                               Subtitle
## 1    1     Greenlights                                                   <NA>
## 2    2 The Ice Diaries The Untold Story of the Cold War's Most Daring Mission
## 3    3     Tiny Habits               The Small Changes That Change Everything
##                                                 Author            Narrator
## 1                                  Matthew McConaughey Matthew McConaughey
## 2 Captain William R. Anderson, Don Keith - contributor       Roger Mueller
## 3                                          BJ Fogg PhD         BJ Fogg PhD
##               Length  Release Language              Stars       Ratings  Price
## 1  6 hrs and 42 mins 10-20-20  English   5 out of 5 stars 7,172 ratings $28.00
## 2   10 hrs and 1 min 10-15-19  English   5 out of 5 stars     6 ratings $30.79
## 3 11 hrs and 22 mins 01-14-20  English 4.5 out of 5 stars   742 ratings $34.95

tail(audiobooks, n = 3) # Show the last 3 rows of the data frame. The default is 6 rows.

##     Rank               Title
## 98    98         If You Tell
## 99    99 Think and Grow Rich
## 100  100     The Housekeeper
##                                                                           Subtitle
## 98  A True Story of Murder, Family Secrets, and the Unbreakable Bond of Sisterhood
## 99                                                                            <NA>
## 100                                               A Twisted Psychological Thriller
##              Author         Narrator             Length  Release Language
## 98      Gregg Olsen     Karen Peakes 10 hrs and 34 mins 12-01-19  English
## 99    Napoleon Hill Erik Synnestvedt  9 hrs and 35 mins 10-16-07  English
## 100 Natalie Barelli    Susie Berneis   8 hrs and 4 mins 01-02-20  English
##                  Stars        Ratings  Price
## 98  4.5 out of 5 stars 11,827 ratings $29.99
## 99  4.5 out of 5 stars 21,869 ratings $24.95
## 100   4 out of 5 stars  6,503 ratings $34.99

dplyr Package Functions

The “dplyr” package provides some useful functions for showing data frame.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sample_n(audiobooks, size = 3) # Sample 3 rows randomly from the dataframe

##   Rank                                         Title
## 1   83                   A Kingdom of Flesh and Fire
## 2   78                                  The Practice
## 3   18 Harry Potter and the Sorcerer's Stone, Book 1
##                                        Subtitle                 Author
## 1 A Blood and Ash Novel (Blood and Ash, Book 2) Jennifer L. Armentrout
## 2                        Shipping Creative Work             Seth Godin
## 3                                          <NA>           J.K. Rowling
##        Narrator             Length  Release Language            Stars
## 1 Stina Nielsen 24 hrs and 21 mins 11-03-20  English 5 out of 5 stars
## 2    Seth Godin  5 hrs and 38 mins 11-03-20  English 5 out of 5 stars
## 3      Jim Dale  8 hrs and 18 mins 11-20-15  English 5 out of 5 stars
##           Ratings  Price
## 1      69 ratings $30.09
## 2      21 ratings $28.00
## 3 123,702 ratings $29.99

sample_frac(audiobooks, size = 0.05) ## Sample 5% of the rows randomly from the dataframe

##   Rank              Title Subtitle                  Author
## 1   82           Daylight     <NA>          David Baldacci
## 2   57        The Sandman     <NA> Neil Gaiman, Dirk Maggs
## 3   47 Sorry I Missed You     <NA>             Suzy Krause
## 4   33       Midnight Sun     <NA>         Stephenie Meyer
## 5   96  Then She Was Gone  A Novel             Lisa Jewell
##                                                                                                                        Narrator
## 1                                                                                                 Brittany Pressley, Kyf Brewer
## 2 Riz Ahmed, Kat Dennings, Taron Egerton, Neil Gaiman, James McAvoy, Samantha Morton, Bebe Neuwirth, Andy Serkis, Michael Sheen
## 3                                                                                                                Amanda Ronconi
## 4                                                                                                                     Jake Abel
## 5                                                                                                                    Helen Duff
##               Length  Release Language              Stars        Ratings  Price
## 1 11 hrs and 37 mins 11-17-20  English               <NA>  Not rated yet $30.79
## 2 10 hrs and 54 mins 07-15-20  English 4.5 out of 5 stars 19,230 ratings $34.95
## 3  9 hrs and 17 mins 06-01-20  English 4.5 out of 5 stars    406 ratings $25.19
## 4 25 hrs and 49 mins 08-04-20  English 4.5 out of 5 stars 11,835 ratings $34.21
## 5 10 hrs and 12 mins 04-17-18  English 4.5 out of 5 stars 53,332 ratings $34.99

Showing Summary

A summary of the data frame can be shown using the summary function. For character variables, it shows the mode among others. For numeric and date variables, it shows the mean, the minimum, the maximum, the median and the quartiles.

summary(members)

##      Name               Height         Birthday         
##  Length:4           Min.   :168.0   Min.   :1990-01-01  
##  Class :character   1st Qu.:169.5   1st Qu.:1990-10-25  
##  Mode  :character   Median :171.0   Median :1992-02-17  
##                     Mean   :172.5   Mean   :1992-02-17  
##                     3rd Qu.:174.0   3rd Qu.:1993-06-10  
##                     Max.   :180.0   Max.   :1994-04-04

5. Accessing Data in Data Frame

Selecting Element(s)

Select a single elements from a data frame by using the row and column indexes.

data <- members[1, 1] # Row 1, column 1
print(data)

## [1] "Ricky"

Select a single elements from a data frame by using double brackets notation.

data <- members[[1]][1] # Row 1, column 1
print(data)

## [1] "Ricky"

Select a single elements from a data frame by using column name in double brackets.

data <- members[['Name']][1] # Row 1, column 1
print(data)

## [1] "Ricky"

Select multiple elements from a data frame by using the row and column indexes.

data <- members[1:2, 1:2] # Rows 1 and 2, columns 1 and 2
print(data)

##      Name Height
## 1   Ricky    170
## 2 Fatimah    172

Column names can be used in selecting the columns.

data <- members[1:2, c('Name', 'Height')] # Rows 1 and 2, columns "Name" and "Height"
print(data)

##      Name Height
## 1   Ricky    170
## 2 Fatimah    172

Selecting Row(s)

Select a single row from a data frame by using the row index.

data <- members[1, ] # Row 1
print(data)

##    Name Height   Birthday
## 1 Ricky    170 1990-01-01

Select multiple rows from a data frame by using the row numbers.

data <- members[1:2, ] # Rows 1 and 2
print(data)

##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02

data <- members[c(1, 3), ] # Rows 1 and 3
print(data)

##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 3 Kumanan    180 1993-03-03

Selecting Column(s)

Select a single column from a data frame by using the column index.

data <- members[ , 1] # Column 1
print(data)

## [1] "Ricky"   "Fatimah" "Kumanan" "Jamaine"

Select multiple columns from a data frame by using the column indexes.

data <- members[ , 1:2] # Columns 1 and 2
print(data)

##      Name Height
## 1   Ricky    170
## 2 Fatimah    172
## 3 Kumanan    180
## 4 Jamaine    168

data <- members[, c(1, 3)] # Columns 1 and 3
print(data)

##      Name   Birthday
## 1   Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04

Column names can be used in selecting the columns.

data <- members[ , c('Name', 'Birthday')] # Columns "Name" and "Birthday"
print(data)

##      Name   Birthday
## 1   Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04

A column can be selected using the column name in bracket.

data <- members['Name']
print(data)

##      Name
## 1   Ricky
## 2 Fatimah
## 3 Kumanan
## 4 Jamaine

Selecting Using Logical Vectors

Data can be selected using logical vectors.

data <- members[c(T, F, T, F), c(T, T, F)] # Select rows 1 and 3, and columns 1 and 2
print(data)

##      Name Height
## 1   Ricky    170
## 3 Kumanan    180

Selecting Using $ Operator

A column can be selected using the format dataframe#column.

data <- members$Name
print(data)

## [1] "Ricky"   "Fatimah" "Kumanan" "Jamaine"

Selecting Based on Condition

Data can be conditionally selected by including a condition in bracket.

data <- members[members$Height > 170, ] # Show members with height more than 170cm
data

##      Name Height   Birthday
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

Data can be conditionally selected by using the subset() function.

data <- subset(members, members$Height > 170) # Show members with height more than 170cm
data

##      Name Height   Birthday
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

6. Data Frame vs Vector

Single Brackets

When single brackets [] are used, the data is returned as a dataframe.

data <- members[1]
class(data)

## [1] "data.frame"

data <- members['Name']
class(data)

## [1] "data.frame"

Double Brackets

When double brackets [[]] are used, the data is returned as a vector.

data <- members[[1]]
class(data)

## [1] "character"

data <- members[['Name']]
class(data)

## [1] "character"

$ Notation

When the $ notation is used, the data is returned as a vector.

data <- members$Name
class(data)

## [1] "character"

Selecting Single Row

Selecting a single row returns a data frame.

data <- members[1, ]
class(data)

## [1] "data.frame"

Use “drop = TRUE” when selecting a row to return a vector instead of a data frame.

data <- members[1, , drop = TRUE]
class(data)

## [1] "list"

Selecting Single Column

Selecting a single column returns a vector.

data <- members[ , 1]
class(data)

## [1] "character"

Use “drop = FALSE” when selecting a column to return a data frame instead of a vector.

data <- members[, 1, drop = FALSE] # Set drop = FALSE
class(data)

## [1] "data.frame"

Selecting Single Element

Selecting a single element returns a vector.

data <- members[1, 1]
class(data)

## [1] "character"

Use “drop = FALSE” when selecting an element to return a data frame instead of a vector.

data <- members[1, 1, drop = FALSE]
class(data)

## [1] "data.frame"

7. More Fun with Data Frame!

Data can be sorted using the order() function given a column name.

height_order <- order(members$Height)
print(height_order)

## [1] 4 1 2 3

members[height_order, ]

##      Name Height   Birthday
## 4 Jamaine    168 1994-04-04
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

Quickly visualize the data in a data frame using the plot function!

plot(members)

Don’t forget to save any updated data frame to a CSV file by using the write.csv() function.

write.csv(members, 'members.csv')

And finally, It’s….

The end!

Picture Source: https://www.pexels.com