WQD7004 ASSIGNMENT 1

Part 1: Introduction

Data frame is a table or two-dimensional array-like data structure in R. It is a list of vectors with each has component of equal length. Each component forms the column and contents of the components form the rows. Each column consists values of a variable and each row consists one set of values from each column. Data frame is used to store data tables.

Characteristics of a data frame include:
Column names should be non-empty;
Row names should be unique;
Data stored in a data frame can be of numeric, logical, character or factor type;
Each column should contains same number of data items.

Part 2: Create data frame

Data frame is created by using data.frame() function.

x <- data.frame("Food"=c("Burger","Laksa","Ramen","Satay"), "Score"=c(60,30,60,90), "Pass"=c(TRUE,FALSE,TRUE,TRUE))
#Return x.
print(x)

##     Food Score  Pass
## 1 Burger    60  TRUE
## 2  Laksa    30 FALSE
## 3  Ramen    60  TRUE
## 4  Satay    90  TRUE

#The header, which is the top line of the table, displays the names of each column. 
#Each horizontal line afterward denotes a data row, which begins with the name of respective row and then follows by the actual data. 
#Each data member of a row is called a cell.

Structure of data frame is returned by using str() function.

str(x)

## 'data.frame':    4 obs. of  3 variables:
##  $ Food : chr  "Burger" "Laksa" "Ramen" "Satay"
##  $ Score: num  60 30 60 90
##  $ Pass : logi  TRUE FALSE TRUE TRUE

Statistical summary and nature of data in data frame is returned by using summary() function.

summary(x)

##      Food               Score         Pass        
##  Length:4           Min.   :30.0   Mode :logical  
##  Class :character   1st Qu.:52.5   FALSE:1        
##  Mode  :character   Median :60.0   TRUE :3        
##                     Mean   :60.0                  
##                     3rd Qu.:67.5                  
##                     Max.   :90.0

Names of variables in data frame is returned by using names() function.

names(x)

## [1] "Food"  "Score" "Pass"

#Results are header of each column in data frame x.

Number of columns in data frame is returned by using ncol() function.

ncol(x)

## [1] 3

Number of rows in data frame is returned by using nrow() function.

nrow(x)

## [1] 4

Number of lists or variables in data frame is returned by using length() function.

length(x)

## [1] 3

Dimension of data frame is returned by using dim() function.

dim(x)

## [1] 4 3

#The results reflect that data frame x is a two-dimensional 4 rows x 3 columns table.

An object’s object-oriented classification is returned by using class() function.

class(x)

## [1] "data.frame"

Data type of an object is returned by using typeof() function.

typeof(x)

## [1] "list"

Name of each row is assigned by using row.names() function.

row.names(x) <- c("Cafe1","Cafe2","Cafe3","Cafe4")
#Return x.
print(x)

##         Food Score  Pass
## Cafe1 Burger    60  TRUE
## Cafe2  Laksa    30 FALSE
## Cafe3  Ramen    60  TRUE
## Cafe4  Satay    90  TRUE

Part 3: Access components of data frame

Components of data frame can be accessed like a list or like a matrix.

When accessing like a list, either single bracket [, double bracket [[ or dollar sign $ operator is used to access columns of data frame.

Single bracket [ example:

#Return second column of data frame x.
x[2]

##       Score
## Cafe1    60
## Cafe2    30
## Cafe3    60
## Cafe4    90

#Return second and third column of data frame x.
x[c(2,3)]

##       Score  Pass
## Cafe1    60  TRUE
## Cafe2    30 FALSE
## Cafe3    60  TRUE
## Cafe4    90  TRUE

Data can also be extracted from specific column of data frame using column name.

#Return column of Food.
x["Food"]

##         Food
## Cafe1 Burger
## Cafe2  Laksa
## Cafe3  Ramen
## Cafe4  Satay

#Return column of Food and Score.
x[c("Food","Score")]

##         Food Score
## Cafe1 Burger    60
## Cafe2  Laksa    30
## Cafe3  Ramen    60
## Cafe4  Satay    90

Double bracket [[ or dollar sign $ example:

#Return all components in first column.
x[[1]]

## [1] "Burger" "Laksa"  "Ramen"  "Satay"

#Return second component in first column.
x[[1]][2]

## [1] "Laksa"

x[[c(1,2)]]

## [1] "Laksa"

#Both x[[1]][2] and x[[c(1,2)]] return the same result.

#Return all components of Score.
x[["Score"]]

## [1] 60 30 60 90

#Return third component of Score.
x[["Score"]][3]

## [1] 60

#Return all components of Pass.
x$Pass

## [1]  TRUE FALSE  TRUE  TRUE

#Return forth component of Pass.
x$Pass[4]

## [1] TRUE

Accessing with double bracket [[ or dollar sign $ is similar. However, it differs for single bracket [ in that indexing with single bracket [ will return a data frame but double bracket [[ and dollar sign $ will reduce it into a vector.

When accessing like a matrix, index of row and column is used to access data frame like a matrix.

#Return second and third row.
x[2:3,] #Leaving column blank will select entire column.

##        Food Score  Pass
## Cafe2 Laksa    30 FALSE
## Cafe3 Ramen    60  TRUE

#Return second and third row; second column.
x[2:3,2]

## [1] 30 60

#Return second and third column.
x[,2:3] #Leaving row blank will select entire row.

##       Score  Pass
## Cafe1    60  TRUE
## Cafe2    30 FALSE
## Cafe3    60  TRUE
## Cafe4    90  TRUE

#Return second row; second and third column.
x[2,2:3]

##       Score  Pass
## Cafe2    30 FALSE

#Return entire row of Cafe2.
x["Cafe2",]

##        Food Score  Pass
## Cafe2 Laksa    30 FALSE

#Return second column and row of Cafe2.
x["Cafe2",2]

## [1] 30

#Return second row and column of Food.
x[2,"Food"]

## [1] "Laksa"

#same as x[["Food"]][2] and x$Food[2]

In cases x[2:3,2], x[“Cafe2”,2] and x[2,“Food”] the returned type is a vector and not a data frame since we extracted data from a single column.

class(x[2:3,])

## [1] "data.frame"

class(x[2:3,2])

## [1] "numeric"

class(x[,2:3])

## [1] "data.frame"

class(x[2,2:3])

## [1] "data.frame"

class(x["Cafe2",])

## [1] "data.frame"

class(x["Cafe2",2])

## [1] "numeric"

class(x[2,"Food"])

## [1] "character"

This behavior can be avoided by passing the argument drop=FALSE as follows.

x[2:3,2,drop=FALSE]

##       Score
## Cafe2    30
## Cafe3    60

class(x[2:3,2,drop=FALSE])

## [1] "data.frame"

x["Cafe2",2,drop=FALSE]

##       Score
## Cafe2    30

class(x["Cafe2",2,drop=FALSE])

## [1] "data.frame"

x[2,"Food",drop=FALSE]

##        Food
## Cafe2 Laksa

class(x[2,"Food",drop=FALSE])

## [1] "data.frame"

Indexing negative value into the bracket simply means select entire data frame without selected rows or columns with negative index in it.

#Return entire row and column without second column. 
x[-2]

##         Food  Pass
## Cafe1 Burger  TRUE
## Cafe2  Laksa FALSE
## Cafe3  Ramen  TRUE
## Cafe4  Satay  TRUE

x[,-2]

##         Food  Pass
## Cafe1 Burger  TRUE
## Cafe2  Laksa FALSE
## Cafe3  Ramen  TRUE
## Cafe4  Satay  TRUE

#Both x[-2] and x[,-2] return the same result. 
#x[-2] is accessed by list while x[,-2]  is accessed by column.

#Return entire row and column without third row.
x[-3,]

##         Food Score  Pass
## Cafe1 Burger    60  TRUE
## Cafe2  Laksa    30 FALSE
## Cafe4  Satay    90  TRUE

#Return entire row and column without second and third row.
x[-c(2,3),]

##         Food Score Pass
## Cafe1 Burger    60 TRUE
## Cafe4  Satay    90 TRUE

x[c(-2,-3),]

##         Food Score Pass
## Cafe1 Burger    60 TRUE
## Cafe4  Satay    90 TRUE

#Both x[-c(2,3),] and x[c(-2,-3),] return the same result.
#They differ by just argument format difference.

#Return entire row and column without first and third column.
x[c(-1,-3)]

##       Score
## Cafe1    60
## Cafe2    30
## Cafe3    60
## Cafe4    90

class(x[c(-1,-3)])

## [1] "data.frame"

x[,c(-1,-3)]

## [1] 60 30 60 90

class(x[,c(-1,-3)])

## [1] "numeric"

#Although both x[c(-1,-3)] and x[,c(-1,-3)] return the same output, but they differ in x[c(-1,-3)] is a data frame while x[,c(-1,-3)] is a vector.
#This is because x[c(-1,-3)] is accessed by list while x[,c(-1,-3)] is accessed by matrix and extracting single column will result in returned type as a vector.

It is possible to slice values of data frame. The rows and columns to return are selected into bracket precede by the name of data frame.

#Return rows which Score is more than 40.
x[x$Score>40,]

##         Food Score Pass
## Cafe1 Burger    60 TRUE
## Cafe3  Ramen    60 TRUE
## Cafe4  Satay    90 TRUE

#Return rows which Pass is True.
x[x$Pass==TRUE,]

##         Food Score Pass
## Cafe1 Burger    60 TRUE
## Cafe3  Ramen    60 TRUE
## Cafe4  Satay    90 TRUE

The first n rows of data frame is returned by using head() function.

#Return first 3 rows of data frame x.
head(x,n=3) #n by default = 6

##         Food Score  Pass
## Cafe1 Burger    60  TRUE
## Cafe2  Laksa    30 FALSE
## Cafe3  Ramen    60  TRUE

The last n rows of data frame is returned by using tail() function.

#Return last 2 rows of data frame x.
tail(x,n=2)

##        Food Score Pass
## Cafe3 Ramen    60 TRUE
## Cafe4 Satay    90 TRUE

Part 4: Modify, subset, add and delete data frame

Modify data frame. Data frame is modified through reassignment like how matrix is modified.

#Modify the first value in Food to Pizza.
x[["Food"]][1] <- "Pizza"
#Return x
print(x)

##        Food Score  Pass
## Cafe1 Pizza    60  TRUE
## Cafe2 Laksa    30 FALSE
## Cafe3 Ramen    60  TRUE
## Cafe4 Satay    90  TRUE

#Modify the third value in Pass to False.
x$Pass[3] <-FALSE
#Return x
print(x)

##        Food Score  Pass
## Cafe1 Pizza    60  TRUE
## Cafe2 Laksa    30 FALSE
## Cafe3 Ramen    60 FALSE
## Cafe4 Satay    90  TRUE

#Modify the values of first until third row; second column to 70, 20 and 70. 
x[1:3,2] <- c(70,20,70)
#Return x
print(x)

##        Food Score  Pass
## Cafe1 Pizza    70  TRUE
## Cafe2 Laksa    20 FALSE
## Cafe3 Ramen    70 FALSE
## Cafe4 Satay    90  TRUE

#Modify entire value of column Score by adding 10.
x$Score <- x$Score + 10
#Return x
print(x)

##        Food Score  Pass
## Cafe1 Pizza    80  TRUE
## Cafe2 Laksa    30 FALSE
## Cafe3 Ramen    80 FALSE
## Cafe4 Satay   100  TRUE

Subset data frame. It is possible to subset data frame based on whether or not a certain condition is true by using subset() function.

#Subset data frame with the condition that Pass is True. 
subset(x,Pass==T)

##        Food Score Pass
## Cafe1 Pizza    80 TRUE
## Cafe4 Satay   100 TRUE

#Subset data frame with the condition that Score is more than 50.
subset(x,Score>50)

##        Food Score  Pass
## Cafe1 Pizza    80  TRUE
## Cafe3 Ramen    80 FALSE
## Cafe4 Satay   100  TRUE

Rows that have values containing “…” are isolated by using grep() function.

#Return a data frame t which includes Score containing "8".
t <- x[grep("8",x$Score),]
#Return t
print(t)

##        Food Score  Pass
## Cafe1 Pizza    80  TRUE
## Cafe3 Ramen    80 FALSE

Rows can be kept by using logical operator.

#Keep the first, second, forth row using logical operator.
x[c(T,T,F,T),]

##        Food Score  Pass
## Cafe1 Pizza    80  TRUE
## Cafe2 Laksa    30 FALSE
## Cafe4 Satay   100  TRUE

The opposite can also be kept by adding exclamation mark !, stating that the reverse is true.

#Keep the alternate row using logical operator.
x[!c(T,T,F,T),]

##        Food Score  Pass
## Cafe3 Ramen    80 FALSE

Add new components into data frame. Rows are added to data frame by using rbind() function.

rbind(x,list("Rojak",60,TRUE))

##        Food Score  Pass
## Cafe1 Pizza    80  TRUE
## Cafe2 Laksa    30 FALSE
## Cafe3 Ramen    80 FALSE
## Cafe4 Satay   100  TRUE
## 5     Rojak    60  TRUE

Similarly, columns are added to data frame by using cbind() function.

cbind(x,"Customer"=seq(1,10,length=4)) #seq() function denotes 4 equal length numeric value starting 1 and ending 10 is returned.

##        Food Score  Pass Customer
## Cafe1 Pizza    80  TRUE        1
## Cafe2 Laksa    30 FALSE        4
## Cafe3 Ramen    80 FALSE        7
## Cafe4 Satay   100  TRUE       10

Since data frame is implemented as list, new columns are added through simple list-like assignments.

x$Price <- seq(1,8,by=2) #seq() function denotes increment of 2 starting from 1 until 8, not exceeding 8.
#Return x.
print(x)

##        Food Score  Pass Price
## Cafe1 Pizza    80  TRUE     1
## Cafe2 Laksa    30 FALSE     3
## Cafe3 Ramen    80 FALSE     5
## Cafe4 Satay   100  TRUE     7

x$Recommendation <- c("Yes", "No", "No", "Yes")
#Return x.
print(x)

##        Food Score  Pass Price Recommendation
## Cafe1 Pizza    80  TRUE     1            Yes
## Cafe2 Laksa    30 FALSE     3             No
## Cafe3 Ramen    80 FALSE     5             No
## Cafe4 Satay   100  TRUE     7            Yes

Delete components in data frame. Columns are deleted by assigning NULL to it.

#Delete Recommendation column.
x$Recommendation <- NULL
#Return x.
print(x)

##        Food Score  Pass Price
## Cafe1 Pizza    80  TRUE     1
## Cafe2 Laksa    30 FALSE     3
## Cafe3 Ramen    80 FALSE     5
## Cafe4 Satay   100  TRUE     7

Similarly, rows are deleted through reassignments.

#Delete first row and assign new variable h to it.
h <- x[-1,]
#Return h.
print(h)

##        Food Score  Pass Price
## Cafe2 Laksa    30 FALSE     3
## Cafe3 Ramen    80 FALSE     5
## Cafe4 Satay   100  TRUE     7

Part 5: Visualize data frame

Numeric values are highly recommended to visualize when presenting in order to get a better overview of the data frame.

Mean is returned by mean() function.

mean(x$Score)

## [1] 72.5

Median is returned by median() function.

median(x$Score)

## [1] 80

Variance is returned by var() function.

var(x$Score)

## [1] 891.6667

Standard deviation is returned by sd() function.

sd(x$Score)

## [1] 29.86079

Position of a quantile is returned by quantile() function.

quantile(x$Score)

##    0%   25%   50%   75%  100% 
##  30.0  67.5  80.0  85.0 100.0

Another shortcut way to reach above results is achieved by using summary() function.

summary(x)

##      Food               Score          Pass             Price    
##  Length:4           Min.   : 30.0   Mode :logical   Min.   :1.0  
##  Class :character   1st Qu.: 67.5   FALSE:2         1st Qu.:2.5  
##  Mode  :character   Median : 80.0   TRUE :2         Median :4.0  
##                     Mean   : 72.5                   Mean   :4.0  
##                     3rd Qu.: 85.0                   3rd Qu.:5.5  
##                     Max.   :100.0                   Max.   :7.0

Numeric values are summed up by using sum() function.

sum(x$Score)

## [1] 290

Components in data frame are counted and displayed as a table by using table() function.

table(x$Food)

## 
## Laksa Pizza Ramen Satay 
##     1     1     1     1

table(x$Score)

## 
##  30  80 100 
##   1   2   1

table(x$Pass)

## 
## FALSE  TRUE 
##     2     2

table(x$Price)

## 
## 1 3 5 7 
## 1 1 1 1

Histogram is produced by using hist() function.

hist(x$Score, n=10, col="blue", main="Histogram of Score of Food", xlab="Score of Food")

Stem plot is produced by using stem() function.

stem(x$Score)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    2 | 0
##    4 | 
##    6 | 
##    8 | 00
##   10 | 0

Mosaic plot is produced by using mosaicplot() function.

mosaicplot(x$Score)

Bar graph is produced by using barplot() function.

barplot(x$Score, col=c("Red","Orange","Yellow","Green"), names=x$Food)

Box plot is produced by using boxplot() function.

boxplot(x$Score, col="Purple" )

Scatter plot is produced by using plot() function.

plot(Score~Price, data=x, col="red", main="Relationship between Score and Price", ylab="Score", xlab="Price")

Part 6: Useful Tips for Data Frame

Numeric values are extracted and fast calculation is calculated by using apply() function.

#Extract all numeric values in data frame.
z <- x[c("Score","Price")]
#Return z.
print(z)

##       Score Price
## Cafe1    80     1
## Cafe2    30     3
## Cafe3    80     5
## Cafe4   100     7

#Return median of all columns z.
apply(z, 2, median)

## Score Price 
##    80     4

#Return median of all rows z.
apply(z, 1, median)

## Cafe1 Cafe2 Cafe3 Cafe4 
##  40.5  16.5  42.5  53.5

#Return mean of all columns z.
apply(z, 2, mean)

## Score Price 
##  72.5   4.0

Ascending order is sorted by using order() function.

#Sort Score in ascending order.
x[order(x$Score),]

##        Food Score  Pass Price
## Cafe2 Laksa    30 FALSE     3
## Cafe1 Pizza    80  TRUE     1
## Cafe3 Ramen    80 FALSE     5
## Cafe4 Satay   100  TRUE     7

Descending order is sorted by passing the argument decreasing=TRUE as follows.

#Sort Price in descending order.
x[order(x$Price, decreasing=TRUE),] #decreasing=FALSE as default.

##        Food Score  Pass Price
## Cafe4 Satay   100  TRUE     7
## Cafe3 Ramen    80 FALSE     5
## Cafe2 Laksa    30 FALSE     3
## Cafe1 Pizza    80  TRUE     1

Descending order can also be sorted by using rev() function together with order() function.

#Sort Price in descending order.
x[rev(order(x$Price)),]

##        Food Score  Pass Price
## Cafe4 Satay   100  TRUE     7
## Cafe3 Ramen    80 FALSE     5
## Cafe2 Laksa    30 FALSE     3
## Cafe1 Pizza    80  TRUE     1

Indexing negative value in order() function simply means sort in descending order.

#Sort Score in descending order.
x[order(-x$Score),]

##        Food Score  Pass Price
## Cafe4 Satay   100  TRUE     7
## Cafe1 Pizza    80  TRUE     1
## Cafe3 Ramen    80 FALSE     5
## Cafe2 Laksa    30 FALSE     3

Two data frames are joined together by using merge() function.

#Create data frame y.
y <- data.frame("Price"=c(1,3,5,7), "Cuisine"=c("Western","Indonesian","Japanese","Malaysian"))
#Return y.
print(y)

##   Price    Cuisine
## 1     1    Western
## 2     3 Indonesian
## 3     5   Japanese
## 4     7  Malaysian

Data frame x and data frame y are joined together by using merge() function.

merge(x,y)

##   Price  Food Score  Pass    Cuisine
## 1     1 Pizza    80  TRUE    Western
## 2     3 Laksa    30 FALSE Indonesian
## 3     5 Ramen    80 FALSE   Japanese
## 4     7 Satay   100  TRUE  Malaysian

The merging example above is a perfect merge since all values in both Price columns match perfectly.

If values in both Price columns are not match, different arguments are presented to solve the problem.

Values in data frame y are changed by reassignment.

y$Price <- c(1,4,5,8)
#Return y.
print(y)

##   Price    Cuisine
## 1     1    Western
## 2     4 Indonesian
## 3     5   Japanese
## 4     8  Malaysian

Natural join or inner join. Rows that match from the data frames are kept, specify argument all=FALSE.

merge(x,y) #By default all=FALSE.

##   Price  Food Score  Pass  Cuisine
## 1     1 Pizza    80  TRUE  Western
## 2     5 Ramen    80 FALSE Japanese

Full outer join or outer join. All rows from both data frames are kept, specify argument all=TRUE.

merge(x,y,all=T)

##   Price  Food Score  Pass    Cuisine
## 1     1 Pizza    80  TRUE    Western
## 2     3 Laksa    30 FALSE       <NA>
## 3     4  <NA>    NA    NA Indonesian
## 4     5 Ramen    80 FALSE   Japanese
## 5     7 Satay   100  TRUE       <NA>
## 6     8  <NA>    NA    NA  Malaysian

Left outer join or left join. All rows of data frame x and only those from y that match are kept, specify argument all.x=TRUE.

merge(x,y,all.x=T)

##   Price  Food Score  Pass  Cuisine
## 1     1 Pizza    80  TRUE  Western
## 2     3 Laksa    30 FALSE     <NA>
## 3     5 Ramen    80 FALSE Japanese
## 4     7 Satay   100  TRUE     <NA>

Right outer join or right join. All rows of data frame y and only those from x that match are kept, specify argument all.y=TRUE.

merge(x,y,all.y=T)

##   Price  Food Score  Pass    Cuisine
## 1     1 Pizza    80  TRUE    Western
## 2     4  <NA>    NA    NA Indonesian
## 3     5 Ramen    80 FALSE   Japanese
## 4     8  <NA>    NA    NA  Malaysian

All rows that contain NA-values are removed by using na.omit() function.

#Take example from merge(x,y,all=T).
q <- merge(x,y,all=T)
#Return q.
print(q)

##   Price  Food Score  Pass    Cuisine
## 1     1 Pizza    80  TRUE    Western
## 2     3 Laksa    30 FALSE       <NA>
## 3     4  <NA>    NA    NA Indonesian
## 4     5 Ramen    80 FALSE   Japanese
## 5     7 Satay   100  TRUE       <NA>
## 6     8  <NA>    NA    NA  Malaysian

na.omit(q)

##   Price  Food Score  Pass  Cuisine
## 1     1 Pizza    80  TRUE  Western
## 4     5 Ramen    80 FALSE Japanese

Multiple columns of data frame are concatenated or combined into a single column by using stack() function.

#Concatenate Score and Price into single column.
stack(x,select =c(Score,Price))

##   values   ind
## 1     80 Score
## 2     30 Score
## 3     80 Score
## 4    100 Score
## 5      1 Price
## 6      3 Price
## 7      5 Price
## 8      7 Price

Data frame is converted into matrix by using as.matrix() function.

matrix_x <- as.matrix(x)
#Return matrix_x.
print(matrix_x)

##       Food    Score Pass    Price
## Cafe1 "Pizza" " 80" "TRUE"  "1"  
## Cafe2 "Laksa" " 30" "FALSE" "3"  
## Cafe3 "Ramen" " 80" "FALSE" "5"  
## Cafe4 "Satay" "100" "TRUE"  "7"

Data frame is converted into list by using as.list() function.

list_x <- as.list(x)
#Return list_x.
print(list_x)

## $Food
## [1] "Pizza" "Laksa" "Ramen" "Satay"
## 
## $Score
## [1]  80  30  80 100
## 
## $Pass
## [1]  TRUE FALSE FALSE  TRUE
## 
## $Price
## [1] 1 3 5 7