Data Frame is a two-dimensional data structure and used to store tabular data in R. It has the variables of a data set as columns and observations as rows. Each column could have different classes of objects.
You could regard data frame as a special case of list which has each component of equal length.
The first six row of a built in data frame example mtcars is printed.
# Load the data
data(mtcars)
# Print the first six rows of the built in data frame mtcars
print(head(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1It is common to work with a large data set in the form of data frame and you want to have a clear understanding of its elements and structures at first.
Showing a small part of the entire data set is often useful while functions head() and tail(), str(), summary() is helpful.
Function head() allows you to show the first 6 rows of a data frame and we print the first 6 rows of the data frame mtcars using head().
print(head(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Function tail() allows you to show the last 6 rows of a data frame and we print the last 6 rows of mtcars using tail().
print(tail(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Function str() shows you the structure of a data frame and you could view the following information.
Here we print the structure of the data frame mtcars.
print(str(mtcars))
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## NULL
Function summary() shows you min, the first quartile, median,mean, the third quartile and max of each variable and we print the summary of the first four columns of the data frame mtcars.
print(summary(mtcars[,1:4]))
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0Instead of using built-in data frame, you are probably interested in creating your own data frame and here are three methods:
Method one: read.csv() or read.table() will generate a data frame.We use read.csv() to read a local file HelloWorld.csv and generate a data frame dataframe_1 and then print it.
dataframe_1 <- read.csv("C:/WQD7004/HelloWorld.csv")
print(dataframe_1)
## X col_1 col_2
## 1 row_1 1 "hello"
## 2 row_2 2 "world"
Method two: A data frame can also be coerced from other types of objects like lists. Here We coerce a list whose each component has the same length to a data frame dataframe_2 and then print dataframe_2.
dataframe_2 <- as.data.frame(list(a = 1:4, b = 100:103))
print(dataframe_2)
## a b
## 1 1 100
## 2 2 101
## 3 3 102
## 4 4 103
Method three: Explicitly create a data frame with the function data.frame(). As arguments, the vectors you pass will become the different columns of your data frame. Each column has the same length and they could contain different types of data. Here we create a dataframe dataframe_3 using data.frame() and then print dataframe_3.
dataframe_3 <- data.frame(name = c('Andrew','Zada', 'Fanny','Iris', 'Jack'), ID = 10:14, Hobby = c('movie', 'music','dance','swim','reading'),From_China = c(TRUE, FALSE, TRUE, FALSE, TRUE))
print(dataframe_3)
## name ID Hobby From_China
## 1 Andrew 10 movie TRUE
## 2 Zada 11 music FALSE
## 3 Fanny 12 dance TRUE
## 4 Iris 13 swim FALSE
## 5 Jack 14 reading TRUEApart from head(), tail(), str() and summary(),you want to further your own created data frame and the following functions is helpful.
names() shows you the names of all columns and we print the columns names of dataframe_3 using names().
print(names(dataframe_3))
## [1] "name" "ID" "Hobby" "From_China"
row.names() shows you the names of all rows and we print the rows names of dataframe_3 using row.names().
print(row.names(dataframe_3))
## [1] "1" "2" "3" "4" "5"
ncol() shows you the number of columns and we print the number of dataframe_3 using ncol().
print(ncol(dataframe_3))
## [1] 4
nrow() shows you the numbers of rows and we print the number of dataframe_3 using nrow().
print(nrow(dataframe_3))
## [1] 5
typeof() tells you the (R internal) type or storage mode of any object and we print the typeof(dataframe_3).
print(typeof(dataframe_3))
## [1] "list"For dataframe_3, you may notice the names of rows are integers instead of strings while the formats for columns names are inconsistent like name and ID and then you want to change them.
Change the names of rows: Reassign the rows names of a data frame using the function row.names(). We reassign the rows names of dataframe_3 using row.names() and then print dataframe_3.
row.names(dataframe_3) <- c('row_1', 'row_2', 'row_3', 'row_4', 'row_5')
print(dataframe_3)
## name ID Hobby From_China
## row_1 Andrew 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUE
Change the names of columns: Reassign the columns names of a data frame using the function names(). We reassign the columns names of dataframe_3 using names() and then print dataframe_3.
names(dataframe_3) <- c("NAME","ID","HOBBY","FROM_CHINA")
print(dataframe_3)
## NAME ID HOBBY FROM_CHINA
## row_1 Andrew 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUE
Change the name using the logical vector. We reassign the name of column HOBBY to INTEREST and then print dataframe_3.
names(dataframe_3)[names(dataframe_3) == 'HOBBY'] <- 'INTEREST'
print(dataframe_3)
## NAME ID INTEREST FROM_CHINA
## row_1 Andrew 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUEMethod 1: Access like a list to access columns of a data frame
Example 1: Use [ operator to select the column NAME of dataframe_3. Then print it and its class and we see it is a data frame.
print(dataframe_3['NAME'])
## NAME
## row_1 Andrew
## row_2 Zada
## row_3 Fanny
## row_4 Iris
## row_5 Jack
print(class(dataframe_3['NAME']))
## [1] "data.frame"
Example 2: Use [[ operator to select the column NAME. Then print it and its class. We find it is a vector.
print(dataframe_3[['NAME']])
## [1] "Andrew" "Zada" "Fanny" "Iris" "Jack"
print(class(dataframe_3[['NAME']]))
## [1] "character"
Example 3: Use $ operator to select the column NAME and then print it and its class. We could notice it is a vector.
print(dataframe_3 $NAME)
## [1] "Andrew" "Zada" "Fanny" "Iris" "Jack"
print(class(dataframe_3 $NAME))
## [1] "character"
Example 4: Here we access the value in column NAME and the second row and then print it.
print(dataframe_3 $ NAME[2])
## [1] "Zada"Method 2: Access like a matrix by providing index for rows and columns
Example 1: Select the second row of dataframe_3, print it and its class. We notice it is a data frame.
print(dataframe_3[2,])
## NAME ID INTEREST FROM_CHINA
## row_2 Zada 11 music FALSE
print(class(dataframe_3[2,]))
## [1] "data.frame"
Example 2: Select the second column of dataframe_3,print it and its class. We see it is a vector.
print(dataframe_3[,2])
## [1] 10 11 12 13 14
print(class(dataframe_3[,2]))
## [1] "integer"
If you want to change the returned vector to a data frame, just set drop = FALSE. Here we set the argument drop in dataframe_3[,2] to FALSE, print it and its class, notice its class is a data frame.
print(dataframe_3[,2,drop = FALSE])
## ID
## row_1 10
## row_2 11
## row_3 12
## row_4 13
## row_5 14
print(class(dataframe_3[,2,drop = FALSE]))
## [1] "data.frame"
Example 3: Select the value in the second row and column NAME of dataframe_3 and then print it.
print(dataframe_3[2,'NAME'])
## [1] "Zada"Advanced method 1: Access through the logical vector as index
Example 1: We use two logical vectors to access the component of dataframe_3 and then print it.
print(dataframe_3[c(TRUE, FALSE, TRUE, TRUE, FALSE),c(TRUE, TRUE, TRUE, FALSE)])
## NAME ID INTEREST
## row_1 Andrew 10 movie
## row_3 Fanny 12 dance
## row_4 Iris 13 swim
Example 2: We select all rows, 2 element logical vector for the column index and we find it is recycled to 4 element logical vector.
print(dataframe_3[,c(TRUE, FALSE)])
## NAME INTEREST
## row_1 Andrew movie
## row_2 Zada music
## row_3 Fanny dance
## row_4 Iris swim
## row_5 Jack reading
Example 3: We select all columns where ID > 12 and print it.
print(dataframe_3[dataframe_3 $ID > 12,])
## NAME ID INTEREST FROM_CHINA
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUEAdvanced method 2: Use function subset() to select a subset from a data frame according to whether or not a certain condition is true. Here we subset the dataframe_3 where ID is above 12 and then print it.
print(subset(dataframe_3, subset = dataframe_3 $ ID > 12))
## NAME ID INTEREST FROM_CHINA
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUEAfter understanding how to access the component of a data frame, you may want more exercises.
Exercise 1: Retrieve data value from row 1, column 2 and then print it
print(dataframe_3[1,2])
## [1] 10
Exercise 2: Retrieve data value from row 1, column 2 using the names and print it
print(dataframe_3['row_1','ID'])
## [1] 10
Exercise 3: Retrieve data of a row eg row_4 and print it
print(dataframe_3['row_4',])
## NAME ID INTEREST FROM_CHINA
## row_4 Iris 13 swim FALSE
Exercise 4: Retrieve two rows and then print it
print(dataframe_3[c('row_1','row_3'),])
## NAME ID INTEREST FROM_CHINA
## row_1 Andrew 10 movie TRUE
## row_3 Fanny 12 dance TRUE
Exercise 5: Retrieve data for a column, print it and its class
print(dataframe_3[['INTEREST']])
## [1] "movie" "music" "dance" "swim" "reading"
print(class(dataframe_3[['INTEREST']]))
## [1] "character"
Exercise 6: Use drop when retrieving data using names, print it and its class
print(dataframe_3[,'INTEREST', drop = FALSE])
## INTEREST
## row_1 movie
## row_2 music
## row_3 dance
## row_4 swim
## row_5 reading
print(class(dataframe_3[,'INTEREST', drop = FALSE]))
## [1] "data.frame"Modify a component of a data frame: Just like we modify a matrix through reassignment. Here we modify the value in first row, column NAME to the string Zhongliang and then print dataframe_3.
dataframe_3[1, 'NAME'] <- 'ZHONGLIANG'
print(dataframe_3)
## NAME ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUEAdd a component of a data frame: Use rbind() to add a row and cbind() to add a column
Add a new row to dataframe_3 using rbind() and then print it
print(rbind(dataframe_3, row_6 = c('Trump', 100, 'joke',FALSE)))
## NAME ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUE
## row_6 Trump 100 joke FALSE
Add a new columns to dataframe_3 using cbind() and print it
print(cbind(dataframe_3, GRADE = c(100, 87,67,22, 99)))
## NAME ID INTEREST FROM_CHINA GRADE
## row_1 ZHONGLIANG 10 movie TRUE 100
## row_2 Zada 11 music FALSE 87
## row_3 Fanny 12 dance TRUE 67
## row_4 Iris 13 swim FALSE 22
## row_5 Jack 14 reading TRUE 99
Print the data frame dataframe_3
print(dataframe_3)
## NAME ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG 10 movie TRUE
## row_2 Zada 11 music FALSE
## row_3 Fanny 12 dance TRUE
## row_4 Iris 13 swim FALSE
## row_5 Jack 14 reading TRUENote: Adding a new column through list-like assignment is also accepted since a data frame is implemented as a list. We add a new column AGE and print dataframe_3.
dataframe_3 $ AGE <- c(24, 31, 16,42,50)
print(dataframe_3)
## NAME ID INTEREST FROM_CHINA AGE
## row_1 ZHONGLIANG 10 movie TRUE 24
## row_2 Zada 11 music FALSE 31
## row_3 Fanny 12 dance TRUE 16
## row_4 Iris 13 swim FALSE 42
## row_5 Jack 14 reading TRUE 50
From above code, we could see that cbind() doesnโt add new column to the original data frame while the assignment operation does.
Remove a component of a data frame: You could also remove a column or row of a data frame.
Removing a column from a data frame is the same way as the list. Here we remove column ID and then print dataframe_3.
dataframe_3 $ ID <- NULL
print(dataframe_3)
## NAME INTEREST FROM_CHINA AGE
## row_1 ZHONGLIANG movie TRUE 24
## row_2 Zada music FALSE 31
## row_3 Fanny dance TRUE 16
## row_4 Iris swim FALSE 42
## row_5 Jack reading TRUE 50Remove a row from a data frame through reassignments. Here we select all rows except the first row and assign it to dataframe_3, then print dataframe_3.
dataframe_3 <- dataframe_3[-1,]
print(dataframe_3)
## NAME INTEREST FROM_CHINA AGE
## row_2 Zada music FALSE 31
## row_3 Fanny dance TRUE 16
## row_4 Iris swim FALSE 42
## row_5 Jack reading TRUE 50You probably notice all values in column AGE are integers and you want to sort dataframe_3 based on the column AGE.
Function order() could give you the ranked position of every element when applied on a variable.
Print the Column AGE in dataframe_3
print(dataframe_3 $ AGE)
## [1] 31 16 42 50
Order the value in column AGE of dataframe_3 and then print
AGE_oder <- order(dataframe_3 $ AGE)
print(AGE_oder)
## [1] 2 1 3 4
As you see 16 in the column AGE is the smallest, it ranks 1. After understanding the order() function, we could rearrange dataframe_3 so that it begins with the smallest age and ends with oldest one. Then we print the ordered dataframe_3.
dataframe_3 <- dataframe_3[AGE_oder,]
print(dataframe_3)
## NAME INTEREST FROM_CHINA AGE
## row_3 Fanny dance TRUE 16
## row_2 Zada music FALSE 31
## row_4 Iris swim FALSE 42
## row_5 Jack reading TRUE 50