The characteristics of a data frame should be:
1. The column names should be non-empty.
2. The row names should be unique.
3. The data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain same number of data items.
Functions such as : read.table(), read.csv(), read.delim(), read.fwf() can be used to read data from other source and convert into data frame.
In this assignment, the data frame will be created based on vector instead of importing dataset.
For the demonstration of the function of data frame, it will be categories in 6 parts, which are:
Part 1: Creating a data frame.
Part 2: Return back value in vector or data frame.
Part 3: Subset in data frame.
Part 4: Sorting and reordering in data frame.
Part 5: Extracting specific data from data frame.
Part 6: Showing partial of the dataframe.
Part 7: Additional methods that can be used in dataframe.
Part 8: Export and import data frame to CSV.
Part 1: Creating a data frame.
The example below shows a data frame of students with 3 list of vector.
x <-data.frame(
std_id = c(1:10),
std_name = c("Rick","Dan","Michelle","Ryan","Adde","Ali",
"Bashri","Siti","Susan","Claris"),
std_gender = c("M","M","F","M","F","M","M","F","F","F"))
print(x)
## std_id std_name std_gender
## 1 1 Rick M
## 2 2 Dan M
## 3 3 Michelle F
## 4 4 Ryan M
## 5 5 Adde F
## 6 6 Ali M
## 7 7 Bashri M
## 8 8 Siti F
## 9 9 Susan F
## 10 10 Claris F
We can modify or adding new component of the data frame by using rbind() for row and cbind() for column.<>/p>
But in this example, I will demonstrate adding new component with cbind().
By combining 2 data frames, it requires to have the same amount of columns and for this example both data frame have the same amount of columns.
y <-data.frame( std_height = c(170,176,161,181,155,176,163,166,160,150),
std_hometown = c("PENANG","KEDAH","SELANGOR","KUALA LUMPUR","JOHOR","PENANG","KEDAH","SELANGOR","KUALA LUMPUR","JOHOR"),
std_grade = c("A","A","A","B","B","C","A","C","C","B"),
std_age = c(14,15,16,16,15,17,15,13,15,16))
xy<-data.frame(x,y)
print(xy)
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
For this part, a new column is added into the combined data frame using cbind().
xy <- cbind (xy, stud_attitude= c("Good","Average","Bad","Good","Average","Bad","Good","Average","Bad","Good"))
print(xy)
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
## 7 Good
## 8 Average
## 9 Bad
## 10 Good
This prints the combined dataframe.
print(xy)
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
## 7 Good
## 8 Average
## 9 Bad
## 10 Good
By applying str(), we can get the structure of the dataframe.
str(xy)
## 'data.frame': 10 obs. of 8 variables:
## $ std_id : int 1 2 3 4 5 6 7 8 9 10
## $ std_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
## $ std_gender : chr "M" "M" "F" "M" ...
## $ std_height : num 170 176 161 181 155 176 163 166 160 150
## $ std_hometown : chr "PENANG" "KEDAH" "SELANGOR" "KUALA LUMPUR" ...
## $ std_grade : chr "A" "A" "A" "B" ...
## $ std_age : num 14 15 16 16 15 17 15 13 15 16
## $ stud_attitude: chr "Good" "Average" "Bad" "Good" ...
By applying summary(), we can print the summary of the data frame that shows the min, the first quartile, median,mean, the third quartile and max of each variable .
print(summary(xy))
## std_id std_name std_gender std_height
## Min. : 1.00 Length:10 Length:10 Min. :150.0
## 1st Qu.: 3.25 Class :character Class :character 1st Qu.:160.2
## Median : 5.50 Mode :character Mode :character Median :164.5
## Mean : 5.50 Mean :165.8
## 3rd Qu.: 7.75 3rd Qu.:174.5
## Max. :10.00 Max. :181.0
## std_hometown std_grade std_age stud_attitude
## Length:10 Length:10 Min. :13.0 Length:10
## Class :character Class :character 1st Qu.:15.0 Class :character
## Mode :character Mode :character Median :15.0 Mode :character
## Mean :15.2
## 3rd Qu.:16.0
## Max. :17.0
For the function that can be used in a data frame they are:
Function names() shows the names of all column available i in the data frame.
names(xy)
## [1] "std_id" "std_name" "std_gender" "std_height"
## [5] "std_hometown" "std_grade" "std_age" "stud_attitude"
Function row.names() shows the names of all row available i in the data frame.
row.names(xy)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Function nrow() shows the number of rows available in the data frame.
nrow(xy)
## [1] 10
Function ncol() shows the number of columns available in the data frame.
ncol(xy)
## [1] 8
Function length() shows the length of vectors in the data frame.
length(xy)
## [1] 8
Function typeof() shows the type of storage mode of any object in the data frame.
typeof(xy)
## [1] "list"
Part 2: Return back value in vector or data frame.
If we want to access the data frame like a list in vector form we can use this 3 methods:
Method 1: Using “data frame name”$“variable name”
xy$std_name
## [1] "Rick" "Dan" "Michelle" "Ryan" "Adde" "Ali"
## [7] "Bashri" "Siti" "Susan" "Claris"
Method 2:Using “data frame name”[[“variable name”]]
xy[["std_name"]]
## [1] "Rick" "Dan" "Michelle" "Ryan" "Adde" "Ali"
## [7] "Bashri" "Siti" "Susan" "Claris"
Method 3: Using “data frame name”[[number of selected column]]
xy[[2]]
## [1] "Rick" "Dan" "Michelle" "Ryan" "Adde" "Ali"
## [7] "Bashri" "Siti" "Susan" "Claris"
class(xy[[2]])
## [1] "character"
But if we want to access the data frame in data frame form, we can write it as:
Method 1:
xy["std_name"]
## std_name
## 1 Rick
## 2 Dan
## 3 Michelle
## 4 Ryan
## 5 Adde
## 6 Ali
## 7 Bashri
## 8 Siti
## 9 Susan
## 10 Claris
Method 2:
xy[2]
## std_name
## 1 Rick
## 2 Dan
## 3 Michelle
## 4 Ryan
## 5 Adde
## 6 Ali
## 7 Bashri
## 8 Siti
## 9 Susan
## 10 Claris
class(xy[2])
## [1] "data.frame"
For R in returning back as a vector or a data frame, it can be done by changing the drop argument.
For example, x2 is set to get column 2 (std_name) data,without setting the drop argument, the default setting is TRUE and the element is return in vector.
x2<-xy[,2]
print(x2)
## [1] "Rick" "Dan" "Michelle" "Ryan" "Adde" "Ali"
## [7] "Bashri" "Siti" "Susan" "Claris"
class(x2)
## [1] "character"
If we set drop=FALSE, we can return the element as dataframe.
x2<-xy[,2,drop=FALSE]
print(x2)
## std_name
## 1 Rick
## 2 Dan
## 3 Michelle
## 4 Ryan
## 5 Adde
## 6 Ali
## 7 Bashri
## 8 Siti
## 9 Susan
## 10 Claris
class(x2)
## [1] "data.frame"
Also there are different way to return in vector or dataframe which are:
Methods return in vector:
class(xy[["std_name"]])
## [1] "character"
class(xy[[2]])
## [1] "character"
class(xy$std_name)
## [1] "character"
Methods return in dataframe:
class(xy["std_name"])
## [1] "data.frame"
class(xy[2])
## [1] "data.frame"
If it is retrieve more than 1 col, it is returned as dataframe.
class(xy[c("std_name","std_id")])
## [1] "data.frame"
If we specify just in row, it is returned as dataframe.
print(xy[2,])
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 2 2 Dan M 176 KEDAH A 15
## stud_attitude
## 2 Average
class(xy[2,])
## [1] "data.frame"
Part 3: Subset in data frame.
For getting the subset in data frame, there are a few example we can use which are :
Eg 1: How would you show all the rows except the last row? (the last row is std_age)
Method : Extract row 1 to 6 and all columns.
xy[1:6,]
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
Eg 2: How would you get the last 6 rows of the data frame?
Method : Extract row 2 to 7 and all columns
xy[2:7,]
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## stud_attitude
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
## 7 Good
Eg 3: How to get data from row 3 column 5 (the ans should be SELANGOR)
xy[3,"std_hometown"]
## [1] "SELANGOR"
Part 4: Sorting and reordering in data frame.
This example shows that we could sort the dataframe by std_height from shortest to highest.
sort(xy$std_height)
## [1] 150 155 160 161 163 166 170 176 176 181
print(xy)
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
## 7 Good
## 8 Average
## 9 Bad
## 10 Good
By using order() function, it returns the vector ranks with a rank position of each element.
ranks <- order(xy$std_height)
print(ranks)
## [1] 10 5 9 3 7 8 1 2 6 4
xy$std_height
## [1] 170 176 161 181 155 176 163 166 160 150
By comparing the value of std_height, it shows that:
The lowest value is 150: its index is 10 which comes first in ranks.
The second lowest value is 155: its index is 5 which comes second in ranks.
The highest value is 181 : its index is 4 which comes last in ranks.
The ranks vector that contains indices that can be used to perform an order dataframe based on std_height.
xy[ranks,]
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 10 10 Claris F 150 JOHOR B 16
## 5 5 Adde F 155 JOHOR B 15
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 3 3 Michelle F 161 SELANGOR A 16
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 6 6 Ali M 176 PENANG C 17
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## stud_attitude
## 10 Good
## 5 Average
## 9 Bad
## 3 Bad
## 7 Good
## 8 Average
## 1 Good
## 2 Average
## 6 Bad
## 4 Good
By getting a descending order dataframe, decreasing argument can be set to TRUE
xy[order(xy$std_height,decreasing = TRUE),]
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 2 2 Dan M 176 KEDAH A 15
## 6 6 Ali M 176 PENANG C 17
## 1 1 Rick M 170 PENANG A 14
## 8 8 Siti F 166 SELANGOR C 13
## 7 7 Bashri M 163 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 5 5 Adde F 155 JOHOR B 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 4 Good
## 2 Average
## 6 Bad
## 1 Good
## 8 Average
## 7 Good
## 3 Bad
## 9 Bad
## 5 Average
## 10 Good
Part 5: Extracting specific data from data frame.
This is an example that show the extracting of specific data from data frame.
A new dataframe is defined as x1 and list 1 to 3 of dataframe xy is extracted into this new data frame.
By using this method, it allows us to acquired specific data we want to use on specific data frame.
x1 <- data.frame(xy$std_id,xy$std_name,xy$std_gender)
print(x1)
## xy.std_id xy.std_name xy.std_gender
## 1 1 Rick M
## 2 2 Dan M
## 3 3 Michelle F
## 4 4 Ryan M
## 5 5 Adde F
## 6 6 Ali M
## 7 7 Bashri M
## 8 8 Siti F
## 9 9 Susan F
## 10 10 Claris F
Part 6: Showing partial of the data frame.
For large dataset, we could use function such as: head() and tail() to give us a view of the elements and structure of the dataset.
By using function head(), it would show the first 6 rows of the data frame.
print(head(xy))
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
By using function tail(), it would show the last 6 rows of the data frame.
print(tail(xy))
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 5 Average
## 6 Bad
## 7 Good
## 8 Average
## 9 Bad
## 10 Good
Part 7: Additional methods that can be used in dataframe.
Method: Accessing data frame through logical vector as index.
Eg 1: Use 2 logical vector (True and False) to access all component of the data frame.
print(xy[c(T,T,F,T,T,F,T,F,T,F),c(T,T,F,F,F,T,T,T)])
## std_id std_name std_grade std_age stud_attitude
## 1 1 Rick A 14 Good
## 2 2 Dan A 15 Average
## 4 4 Ryan B 16 Good
## 5 5 Adde B 15 Average
## 7 7 Bashri A 15 Good
## 9 9 Susan C 15 Bad
Eg 2: Select all rows, 2 logical vector for the column index.
It is shown that it is recycled into 4 element logical vector.
print(xy[,c(T,F)])
## std_id std_gender std_hometown std_age
## 1 1 M PENANG 14
## 2 2 M KEDAH 15
## 3 3 F SELANGOR 16
## 4 4 M KUALA LUMPUR 16
## 5 5 F JOHOR 15
## 6 6 M PENANG 17
## 7 7 M KEDAH 15
## 8 8 F SELANGOR 13
## 9 9 F KUALA LUMPUR 15
## 10 10 F JOHOR 16
Eg 3: Select column std_grade = A and print it .
print(xy[xy$std_grade=="A",])
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 7 7 Bashri M 163 KEDAH A 15
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 7 Good
Part 8: Export and import data frame to CSV.
For the export and import of data frame in R, it can be done by using function write.csv() and read.csv().
The example below shows that the data frame created in this assignment has named as “assignment 1” and exported in csv format.
write.csv(xy,"assignment 1.csv",row.names = F)
For the re-import “assignment 1.csv” into Rstudio, it can be done by using read.csv, the example is show below:
assignment1 = read.csv("assignment 1.csv")
assignment1
## std_id std_name std_gender std_height std_hometown std_grade std_age
## 1 1 Rick M 170 PENANG A 14
## 2 2 Dan M 176 KEDAH A 15
## 3 3 Michelle F 161 SELANGOR A 16
## 4 4 Ryan M 181 KUALA LUMPUR B 16
## 5 5 Adde F 155 JOHOR B 15
## 6 6 Ali M 176 PENANG C 17
## 7 7 Bashri M 163 KEDAH A 15
## 8 8 Siti F 166 SELANGOR C 13
## 9 9 Susan F 160 KUALA LUMPUR C 15
## 10 10 Claris F 150 JOHOR B 16
## stud_attitude
## 1 Good
## 2 Average
## 3 Bad
## 4 Good
## 5 Average
## 6 Bad
## 7 Good
## 8 Average
## 9 Bad
## 10 Good