Data frame is a table or two-dimensional array-like data structure in R. It is a list of vectors with each has component of equal length. Each component forms the column and contents of the components form the rows. Each column consists values of a variable and each row consists one set of values from each column. Data frame is used to store data tables.
Characteristics of a data frame include:
Column names should be non-empty;
Row names should be unique;
Data stored in a data frame can be of numeric, logical, character or factor type;
Each column should contains same number of data items.
Data frame is created by using data.frame() function.
x <- data.frame("Food"=c("Burger","Laksa","Ramen","Satay"), "Score"=c(60,30,60,90), "Pass"=c(TRUE,FALSE,TRUE,TRUE))
#Return x.
print(x)
## Food Score Pass
## 1 Burger 60 TRUE
## 2 Laksa 30 FALSE
## 3 Ramen 60 TRUE
## 4 Satay 90 TRUE
#The header, which is the top line of the table, displays the names of each column.
#Each horizontal line afterward denotes a data row, which begins with the name of respective row and then follows by the actual data.
#Each data member of a row is called a cell.
Structure of data frame is returned by using str() function.
str(x)
## 'data.frame': 4 obs. of 3 variables:
## $ Food : chr "Burger" "Laksa" "Ramen" "Satay"
## $ Score: num 60 30 60 90
## $ Pass : logi TRUE FALSE TRUE TRUE
Statistical summary and nature of data in data frame is returned by using summary() function.
summary(x)
## Food Score Pass
## Length:4 Min. :30.0 Mode :logical
## Class :character 1st Qu.:52.5 FALSE:1
## Mode :character Median :60.0 TRUE :3
## Mean :60.0
## 3rd Qu.:67.5
## Max. :90.0
Names of variables in data frame is returned by using names() function.
names(x)
## [1] "Food" "Score" "Pass"
#Results are header of each column in data frame x.
Number of columns in data frame is returned by using ncol() function.
ncol(x)
## [1] 3
Number of rows in data frame is returned by using nrow() function.
nrow(x)
## [1] 4
Number of lists or variables in data frame is returned by using length() function.
length(x)
## [1] 3
Dimension of data frame is returned by using dim() function.
dim(x)
## [1] 4 3
#The results reflect that data frame x is a two-dimensional 4 rows x 3 columns table.
An object’s object-oriented classification is returned by using class() function.
class(x)
## [1] "data.frame"
Data type of an object is returned by using typeof() function.
typeof(x)
## [1] "list"
Name of each row is assigned by using row.names() function.
row.names(x) <- c("Cafe1","Cafe2","Cafe3","Cafe4")
#Return x.
print(x)
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 60 TRUE
## Cafe4 Satay 90 TRUE
Components of data frame can be accessed like a list or like a matrix.
When accessing like a list, either single bracket [, double bracket [[ or dollar sign $ operator is used to access columns of data frame.
Single bracket [ example:
#Return second column of data frame x.
x[2]
## Score
## Cafe1 60
## Cafe2 30
## Cafe3 60
## Cafe4 90
#Return second and third column of data frame x.
x[c(2,3)]
## Score Pass
## Cafe1 60 TRUE
## Cafe2 30 FALSE
## Cafe3 60 TRUE
## Cafe4 90 TRUE
Data can also be extracted from specific column of data frame using column name.
#Return column of Food.
x["Food"]
## Food
## Cafe1 Burger
## Cafe2 Laksa
## Cafe3 Ramen
## Cafe4 Satay
#Return column of Food and Score.
x[c("Food","Score")]
## Food Score
## Cafe1 Burger 60
## Cafe2 Laksa 30
## Cafe3 Ramen 60
## Cafe4 Satay 90
Double bracket [[ or dollar sign $ example:
#Return all components in first column.
x[[1]]
## [1] "Burger" "Laksa" "Ramen" "Satay"
#Return second component in first column.
x[[1]][2]
## [1] "Laksa"
x[[c(1,2)]]
## [1] "Laksa"
#Both x[[1]][2] and x[[c(1,2)]] return the same result.
#Return all components of Score.
x[["Score"]]
## [1] 60 30 60 90
#Return third component of Score.
x[["Score"]][3]
## [1] 60
#Return all components of Pass.
x$Pass
## [1] TRUE FALSE TRUE TRUE
#Return forth component of Pass.
x$Pass[4]
## [1] TRUE
Accessing with double bracket [[ or dollar sign $ is similar. However, it differs for single bracket [ in that indexing with single bracket [ will return a data frame but double bracket [[ and dollar sign $ will reduce it into a vector.
When accessing like a matrix, index of row and column is used to access data frame like a matrix.
#Return second and third row.
x[2:3,] #Leaving column blank will select entire column.
## Food Score Pass
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 60 TRUE
#Return second and third row; second column.
x[2:3,2]
## [1] 30 60
#Return second and third column.
x[,2:3] #Leaving row blank will select entire row.
## Score Pass
## Cafe1 60 TRUE
## Cafe2 30 FALSE
## Cafe3 60 TRUE
## Cafe4 90 TRUE
#Return second row; second and third column.
x[2,2:3]
## Score Pass
## Cafe2 30 FALSE
#Return entire row of Cafe2.
x["Cafe2",]
## Food Score Pass
## Cafe2 Laksa 30 FALSE
#Return second column and row of Cafe2.
x["Cafe2",2]
## [1] 30
#Return second row and column of Food.
x[2,"Food"]
## [1] "Laksa"
#same as x[["Food"]][2] and x$Food[2]
In cases x[2:3,2], x[“Cafe2”,2] and x[2,“Food”] the returned type is a vector and not a data frame since we extracted data from a single column.
class(x[2:3,])
## [1] "data.frame"
class(x[2:3,2])
## [1] "numeric"
class(x[,2:3])
## [1] "data.frame"
class(x[2,2:3])
## [1] "data.frame"
class(x["Cafe2",])
## [1] "data.frame"
class(x["Cafe2",2])
## [1] "numeric"
class(x[2,"Food"])
## [1] "character"
This behavior can be avoided by passing the argument drop=FALSE as follows.
x[2:3,2,drop=FALSE]
## Score
## Cafe2 30
## Cafe3 60
class(x[2:3,2,drop=FALSE])
## [1] "data.frame"
x["Cafe2",2,drop=FALSE]
## Score
## Cafe2 30
class(x["Cafe2",2,drop=FALSE])
## [1] "data.frame"
x[2,"Food",drop=FALSE]
## Food
## Cafe2 Laksa
class(x[2,"Food",drop=FALSE])
## [1] "data.frame"
Indexing negative value into the bracket simply means select entire data frame without selected rows or columns with negative index in it.
#Return entire row and column without second column.
x[-2]
## Food Pass
## Cafe1 Burger TRUE
## Cafe2 Laksa FALSE
## Cafe3 Ramen TRUE
## Cafe4 Satay TRUE
x[,-2]
## Food Pass
## Cafe1 Burger TRUE
## Cafe2 Laksa FALSE
## Cafe3 Ramen TRUE
## Cafe4 Satay TRUE
#Both x[-2] and x[,-2] return the same result.
#x[-2] is accessed by list while x[,-2] is accessed by column.
#Return entire row and column without third row.
x[-3,]
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe4 Satay 90 TRUE
#Return entire row and column without second and third row.
x[-c(2,3),]
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe4 Satay 90 TRUE
x[c(-2,-3),]
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe4 Satay 90 TRUE
#Both x[-c(2,3),] and x[c(-2,-3),] return the same result.
#They differ by just argument format difference.
#Return entire row and column without first and third column.
x[c(-1,-3)]
## Score
## Cafe1 60
## Cafe2 30
## Cafe3 60
## Cafe4 90
class(x[c(-1,-3)])
## [1] "data.frame"
x[,c(-1,-3)]
## [1] 60 30 60 90
class(x[,c(-1,-3)])
## [1] "numeric"
#Although both x[c(-1,-3)] and x[,c(-1,-3)] return the same output, but they differ in x[c(-1,-3)] is a data frame while x[,c(-1,-3)] is a vector.
#This is because x[c(-1,-3)] is accessed by list while x[,c(-1,-3)] is accessed by matrix and extracting single column will result in returned type as a vector.
It is possible to slice values of data frame. The rows and columns to return are selected into bracket precede by the name of data frame.
#Return rows which Score is more than 40.
x[x$Score>40,]
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe3 Ramen 60 TRUE
## Cafe4 Satay 90 TRUE
#Return rows which Pass is True.
x[x$Pass==TRUE,]
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe3 Ramen 60 TRUE
## Cafe4 Satay 90 TRUE
The first n rows of data frame is returned by using head() function.
#Return first 3 rows of data frame x.
head(x,n=3) #n by default = 6
## Food Score Pass
## Cafe1 Burger 60 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 60 TRUE
The last n rows of data frame is returned by using tail() function.
#Return last 2 rows of data frame x.
tail(x,n=2)
## Food Score Pass
## Cafe3 Ramen 60 TRUE
## Cafe4 Satay 90 TRUE
Modify data frame. Data frame is modified through reassignment like how matrix is modified.
#Modify the first value in Food to Pizza.
x[["Food"]][1] <- "Pizza"
#Return x
print(x)
## Food Score Pass
## Cafe1 Pizza 60 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 60 TRUE
## Cafe4 Satay 90 TRUE
#Modify the third value in Pass to False.
x$Pass[3] <-FALSE
#Return x
print(x)
## Food Score Pass
## Cafe1 Pizza 60 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 60 FALSE
## Cafe4 Satay 90 TRUE
#Modify the values of first until third row; second column to 70, 20 and 70.
x[1:3,2] <- c(70,20,70)
#Return x
print(x)
## Food Score Pass
## Cafe1 Pizza 70 TRUE
## Cafe2 Laksa 20 FALSE
## Cafe3 Ramen 70 FALSE
## Cafe4 Satay 90 TRUE
#Modify entire value of column Score by adding 10.
x$Score <- x$Score + 10
#Return x
print(x)
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 80 FALSE
## Cafe4 Satay 100 TRUE
Subset data frame. It is possible to subset data frame based on whether or not a certain condition is true by using subset() function.
#Subset data frame with the condition that Pass is True.
subset(x,Pass==T)
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe4 Satay 100 TRUE
#Subset data frame with the condition that Score is more than 50.
subset(x,Score>50)
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe3 Ramen 80 FALSE
## Cafe4 Satay 100 TRUE
Rows that have values containing “…” are isolated by using grep() function.
#Return a data frame t which includes Score containing "8".
t <- x[grep("8",x$Score),]
#Return t
print(t)
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe3 Ramen 80 FALSE
Rows can be kept by using logical operator.
#Keep the first, second, forth row using logical operator.
x[c(T,T,F,T),]
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe4 Satay 100 TRUE
The opposite can also be kept by adding exclamation mark !, stating that the reverse is true.
#Keep the alternate row using logical operator.
x[!c(T,T,F,T),]
## Food Score Pass
## Cafe3 Ramen 80 FALSE
Add new components into data frame. Rows are added to data frame by using rbind() function.
rbind(x,list("Rojak",60,TRUE))
## Food Score Pass
## Cafe1 Pizza 80 TRUE
## Cafe2 Laksa 30 FALSE
## Cafe3 Ramen 80 FALSE
## Cafe4 Satay 100 TRUE
## 5 Rojak 60 TRUE
Similarly, columns are added to data frame by using cbind() function.
cbind(x,"Customer"=seq(1,10,length=4)) #seq() function denotes 4 equal length numeric value starting 1 and ending 10 is returned.
## Food Score Pass Customer
## Cafe1 Pizza 80 TRUE 1
## Cafe2 Laksa 30 FALSE 4
## Cafe3 Ramen 80 FALSE 7
## Cafe4 Satay 100 TRUE 10
Since data frame is implemented as list, new columns are added through simple list-like assignments.
x$Price <- seq(1,8,by=2) #seq() function denotes increment of 2 starting from 1 until 8, not exceeding 8.
#Return x.
print(x)
## Food Score Pass Price
## Cafe1 Pizza 80 TRUE 1
## Cafe2 Laksa 30 FALSE 3
## Cafe3 Ramen 80 FALSE 5
## Cafe4 Satay 100 TRUE 7
x$Recommendation <- c("Yes", "No", "No", "Yes")
#Return x.
print(x)
## Food Score Pass Price Recommendation
## Cafe1 Pizza 80 TRUE 1 Yes
## Cafe2 Laksa 30 FALSE 3 No
## Cafe3 Ramen 80 FALSE 5 No
## Cafe4 Satay 100 TRUE 7 Yes
Delete components in data frame. Columns are deleted by assigning NULL to it.
#Delete Recommendation column.
x$Recommendation <- NULL
#Return x.
print(x)
## Food Score Pass Price
## Cafe1 Pizza 80 TRUE 1
## Cafe2 Laksa 30 FALSE 3
## Cafe3 Ramen 80 FALSE 5
## Cafe4 Satay 100 TRUE 7
Similarly, rows are deleted through reassignments.
#Delete first row and assign new variable h to it.
h <- x[-1,]
#Return h.
print(h)
## Food Score Pass Price
## Cafe2 Laksa 30 FALSE 3
## Cafe3 Ramen 80 FALSE 5
## Cafe4 Satay 100 TRUE 7
Numeric values are highly recommended to visualize when presenting in order to get a better overview of the data frame.
Mean is returned by mean() function.
mean(x$Score)
## [1] 72.5
Median is returned by median() function.
median(x$Score)
## [1] 80
Variance is returned by var() function.
var(x$Score)
## [1] 891.6667
Standard deviation is returned by sd() function.
sd(x$Score)
## [1] 29.86079
Position of a quantile is returned by quantile() function.
quantile(x$Score)
## 0% 25% 50% 75% 100%
## 30.0 67.5 80.0 85.0 100.0
Another shortcut way to reach above results is achieved by using summary() function.
summary(x)
## Food Score Pass Price
## Length:4 Min. : 30.0 Mode :logical Min. :1.0
## Class :character 1st Qu.: 67.5 FALSE:2 1st Qu.:2.5
## Mode :character Median : 80.0 TRUE :2 Median :4.0
## Mean : 72.5 Mean :4.0
## 3rd Qu.: 85.0 3rd Qu.:5.5
## Max. :100.0 Max. :7.0
Numeric values are summed up by using sum() function.
sum(x$Score)
## [1] 290
Components in data frame are counted and displayed as a table by using table() function.
table(x$Food)
##
## Laksa Pizza Ramen Satay
## 1 1 1 1
table(x$Score)
##
## 30 80 100
## 1 2 1
table(x$Pass)
##
## FALSE TRUE
## 2 2
table(x$Price)
##
## 1 3 5 7
## 1 1 1 1
Histogram is produced by using hist() function.
hist(x$Score, n=10, col="blue", main="Histogram of Score of Food", xlab="Score of Food")
Stem plot is produced by using stem() function.
stem(x$Score)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 0
## 4 |
## 6 |
## 8 | 00
## 10 | 0
Mosaic plot is produced by using mosaicplot() function.
mosaicplot(x$Score)
Bar graph is produced by using barplot() function.
barplot(x$Score, col=c("Red","Orange","Yellow","Green"), names=x$Food)
Box plot is produced by using boxplot() function.
boxplot(x$Score, col="Purple" )
Scatter plot is produced by using plot() function.
plot(Score~Price, data=x, col="red", main="Relationship between Score and Price", ylab="Score", xlab="Price")
Numeric values are extracted and fast calculation is calculated by using apply() function.
#Extract all numeric values in data frame.
z <- x[c("Score","Price")]
#Return z.
print(z)
## Score Price
## Cafe1 80 1
## Cafe2 30 3
## Cafe3 80 5
## Cafe4 100 7
#Return median of all columns z.
apply(z, 2, median)
## Score Price
## 80 4
#Return median of all rows z.
apply(z, 1, median)
## Cafe1 Cafe2 Cafe3 Cafe4
## 40.5 16.5 42.5 53.5
#Return mean of all columns z.
apply(z, 2, mean)
## Score Price
## 72.5 4.0
Ascending order is sorted by using order() function.
#Sort Score in ascending order.
x[order(x$Score),]
## Food Score Pass Price
## Cafe2 Laksa 30 FALSE 3
## Cafe1 Pizza 80 TRUE 1
## Cafe3 Ramen 80 FALSE 5
## Cafe4 Satay 100 TRUE 7
Descending order is sorted by passing the argument decreasing=TRUE as follows.
#Sort Price in descending order.
x[order(x$Price, decreasing=TRUE),] #decreasing=FALSE as default.
## Food Score Pass Price
## Cafe4 Satay 100 TRUE 7
## Cafe3 Ramen 80 FALSE 5
## Cafe2 Laksa 30 FALSE 3
## Cafe1 Pizza 80 TRUE 1
Descending order can also be sorted by using rev() function together with order() function.
#Sort Price in descending order.
x[rev(order(x$Price)),]
## Food Score Pass Price
## Cafe4 Satay 100 TRUE 7
## Cafe3 Ramen 80 FALSE 5
## Cafe2 Laksa 30 FALSE 3
## Cafe1 Pizza 80 TRUE 1
Indexing negative value in order() function simply means sort in descending order.
#Sort Score in descending order.
x[order(-x$Score),]
## Food Score Pass Price
## Cafe4 Satay 100 TRUE 7
## Cafe1 Pizza 80 TRUE 1
## Cafe3 Ramen 80 FALSE 5
## Cafe2 Laksa 30 FALSE 3
Two data frames are joined together by using merge() function.
#Create data frame y.
y <- data.frame("Price"=c(1,3,5,7), "Cuisine"=c("Western","Indonesian","Japanese","Malaysian"))
#Return y.
print(y)
## Price Cuisine
## 1 1 Western
## 2 3 Indonesian
## 3 5 Japanese
## 4 7 Malaysian
Data frame x and data frame y are joined together by using merge() function.
merge(x,y)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 3 Laksa 30 FALSE Indonesian
## 3 5 Ramen 80 FALSE Japanese
## 4 7 Satay 100 TRUE Malaysian
The merging example above is a perfect merge since all values in both Price columns match perfectly.
If values in both Price columns are not match, different arguments are presented to solve the problem.
Values in data frame y are changed by reassignment.
y$Price <- c(1,4,5,8)
#Return y.
print(y)
## Price Cuisine
## 1 1 Western
## 2 4 Indonesian
## 3 5 Japanese
## 4 8 Malaysian
Natural join or inner join. Rows that match from the data frames are kept, specify argument all=FALSE.
merge(x,y) #By default all=FALSE.
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 5 Ramen 80 FALSE Japanese
Full outer join or outer join. All rows from both data frames are kept, specify argument all=TRUE.
merge(x,y,all=T)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 3 Laksa 30 FALSE <NA>
## 3 4 <NA> NA NA Indonesian
## 4 5 Ramen 80 FALSE Japanese
## 5 7 Satay 100 TRUE <NA>
## 6 8 <NA> NA NA Malaysian
Left outer join or left join. All rows of data frame x and only those from y that match are kept, specify argument all.x=TRUE.
merge(x,y,all.x=T)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 3 Laksa 30 FALSE <NA>
## 3 5 Ramen 80 FALSE Japanese
## 4 7 Satay 100 TRUE <NA>
Right outer join or right join. All rows of data frame y and only those from x that match are kept, specify argument all.y=TRUE.
merge(x,y,all.y=T)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 4 <NA> NA NA Indonesian
## 3 5 Ramen 80 FALSE Japanese
## 4 8 <NA> NA NA Malaysian
All rows that contain NA-values are removed by using na.omit() function.
#Take example from merge(x,y,all=T).
q <- merge(x,y,all=T)
#Return q.
print(q)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 2 3 Laksa 30 FALSE <NA>
## 3 4 <NA> NA NA Indonesian
## 4 5 Ramen 80 FALSE Japanese
## 5 7 Satay 100 TRUE <NA>
## 6 8 <NA> NA NA Malaysian
na.omit(q)
## Price Food Score Pass Cuisine
## 1 1 Pizza 80 TRUE Western
## 4 5 Ramen 80 FALSE Japanese
Multiple columns of data frame are concatenated or combined into a single column by using stack() function.
#Concatenate Score and Price into single column.
stack(x,select =c(Score,Price))
## values ind
## 1 80 Score
## 2 30 Score
## 3 80 Score
## 4 100 Score
## 5 1 Price
## 6 3 Price
## 7 5 Price
## 8 7 Price
Data frame is converted into matrix by using as.matrix() function.
matrix_x <- as.matrix(x)
#Return matrix_x.
print(matrix_x)
## Food Score Pass Price
## Cafe1 "Pizza" " 80" "TRUE" "1"
## Cafe2 "Laksa" " 30" "FALSE" "3"
## Cafe3 "Ramen" " 80" "FALSE" "5"
## Cafe4 "Satay" "100" "TRUE" "7"
Data frame is converted into list by using as.list() function.
list_x <- as.list(x)
#Return list_x.
print(list_x)
## $Food
## [1] "Pizza" "Laksa" "Ramen" "Satay"
##
## $Score
## [1] 80 30 80 100
##
## $Pass
## [1] TRUE FALSE FALSE TRUE
##
## $Price
## [1] 1 3 5 7