What is a data frame in R?

* In R, a data frame is a 2 dimensional data structure which is essentially a list of components with equal length.

* Each of these components are represented in the form of a table with the components being each of the columns and its contents, the rows.


Creating a data frame in R

* A data frame can be created using the “data.frame()” function in R.

x <- data.frame("SN"=1:8, "Name"=c("Samuel","Dory","Ken","Danny","Sarah","Dan","Kenny","Derrick"), "Age"=c(23,21,24,22,22,41,28,31), "Blood Type"=c("A","A","B","B","B","B","O","A"), stringsAsFactors = FALSE)

* Check if a variable is a data frame using the class() function and print the data frame

class(x)
## [1] "data.frame"
x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  21          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A

* Checking the structure of a data frame using the str() function.

str(x)
## 'data.frame':    8 obs. of  4 variables:
##  $ SN        : int  1 2 3 4 5 6 7 8
##  $ Name      : chr  "Samuel" "Dory" "Ken" "Danny" ...
##  $ Age       : num  23 21 24 22 22 41 28 31
##  $ Blood.Type: chr  "A" "A" "B" "B" ...

* Checking the statistical summary of a data frame using the summary() function.

summary(x)
##        SN           Name                Age         Blood.Type       
##  Min.   :1.00   Length:8           Min.   :21.00   Length:8          
##  1st Qu.:2.75   Class :character   1st Qu.:22.00   Class :character  
##  Median :4.50   Mode  :character   Median :23.50   Mode  :character  
##  Mean   :4.50                      Mean   :26.50                     
##  3rd Qu.:6.25                      3rd Qu.:28.75                     
##  Max.   :8.00                      Max.   :41.00

* Checking the variables of a data frame using the names() function.

names(x)
## [1] "SN"         "Name"       "Age"        "Blood.Type"

* Checking the number of columns of a data frame using the ncol() function.

ncol(x)
## [1] 4

* Checking the number of rows of a data frame using the nrow() function.

nrow(x)
## [1] 8

* Checking the length of the list in a data frame using the length() function, similar to ncol().

length(x)
## [1] 4

* Checking the names of the variables with names() function

names(x)
## [1] "SN"         "Name"       "Age"        "Blood.Type"

* Checking the names of each of the rows or observations with row.names() function

row.names(x)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"


Accessing a data frame in R

* Accessing a data frame in R is very similar to accessing a matrix or a list.

* Show the data frame

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  21          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A

* Accessing a data frame by column using [], [[]] or $. [] will return in the form of data frame where as the rest of the 2 will return in the form of vector.

x["Age"]
##   Age
## 1  23
## 2  21
## 3  24
## 4  22
## 5  22
## 6  41
## 7  28
## 8  31
x[["Blood Type"]]
## NULL
x$Name
## [1] "Samuel"  "Dory"    "Ken"     "Danny"   "Sarah"   "Dan"     "Kenny"  
## [8] "Derrick"
x[[2]]
## [1] "Samuel"  "Dory"    "Ken"     "Danny"   "Sarah"   "Dan"     "Kenny"  
## [8] "Derrick"

* When accessing a data frame using [], the result returned will in in data frame form

a <- x["Name"]
class(a)
## [1] "data.frame"

* When accessing a data frame using [[]], the result returned will be reduced to vector form

b <- x[["Age"]]
class(b)
## [1] "numeric"

* When accessing a data frame using $, the result returned will be reduced to vector form

c <- x$Name
class(c)
## [1] "character"

* Accessing a data frame by row using df[]

x[1,]
##   SN   Name Age Blood.Type
## 1  1 Samuel  23          A
x[1:3,]
##   SN   Name Age Blood.Type
## 1  1 Samuel  23          A
## 2  2   Dory  21          A
## 3  3    Ken  24          B

* Accessing a data frame like a matrix

* Accessing the top few rows of a data frame

head(x,n=3)
##   SN   Name Age Blood.Type
## 1  1 Samuel  23          A
## 2  2   Dory  21          A
## 3  3    Ken  24          B

* Accessing specific variable of a specific rows/observation

x[1:2,2]
## [1] "Samuel" "Dory"

* Accessing a data frame with conditions

x[x$Age>30,]
##   SN    Name Age Blood.Type
## 6  6     Dan  41          B
## 8  8 Derrick  31          A

* Avoid result to be returned as vector, but in data frame form, use “drop=FALSE”

x[,3,drop=FALSE]
##   Age
## 1  23
## 2  21
## 3  24
## 4  22
## 5  22
## 6  41
## 7  28
## 8  31

* Subset a data frame under a specific condition

subset(x, subset=Age>30)
##   SN    Name Age Blood.Type
## 6  6     Dan  41          B
## 8  8 Derrick  31          A


Modifying a data frame in R

* Modification of a data frame through reassignment

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  21          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A
x[2,"Age"] <- 26

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  26          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A

* Adding rows/observation to a data frame using the rbind() function

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  26          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A
rbind(x,list(9,"Tom",15,"O"))
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  26          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A
## 9  9     Tom  15          O

* Adding columns to a data frame using the cbind() function

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  26          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A
cbind(x,"Gender"=c("M","F","M","M","F","M","M","M"))
##   SN    Name Age Blood.Type Gender
## 1  1  Samuel  23          A      M
## 2  2    Dory  26          A      F
## 3  3     Ken  24          B      M
## 4  4   Danny  22          B      M
## 5  5   Sarah  22          B      F
## 6  6     Dan  41          B      M
## 7  7   Kenny  28          O      M
## 8  8 Derrick  31          A      M

* Adding columns to a data frame based on existing columns

x
##   SN    Name Age Blood.Type
## 1  1  Samuel  23          A
## 2  2    Dory  26          A
## 3  3     Ken  24          B
## 4  4   Danny  22          B
## 5  5   Sarah  22          B
## 6  6     Dan  41          B
## 7  7   Kenny  28          O
## 8  8 Derrick  31          A
x$AgeSN <- (x$SN * x$Age)

x
##   SN    Name Age Blood.Type AgeSN
## 1  1  Samuel  23          A    23
## 2  2    Dory  26          A    52
## 3  3     Ken  24          B    72
## 4  4   Danny  22          B    88
## 5  5   Sarah  22          B   110
## 6  6     Dan  41          B   246
## 7  7   Kenny  28          O   196
## 8  8 Derrick  31          A   248

* Adding a vector as a variable to a data frame

x
##   SN    Name Age Blood.Type AgeSN
## 1  1  Samuel  23          A    23
## 2  2    Dory  26          A    52
## 3  3     Ken  24          B    72
## 4  4   Danny  22          B    88
## 5  5   Sarah  22          B   110
## 6  6     Dan  41          B   246
## 7  7   Kenny  28          O   196
## 8  8 Derrick  31          A   248
ID <- c(121,231,452,109,223,76,090,564)
x$ID <- ID

x
##   SN    Name Age Blood.Type AgeSN  ID
## 1  1  Samuel  23          A    23 121
## 2  2    Dory  26          A    52 231
## 3  3     Ken  24          B    72 452
## 4  4   Danny  22          B    88 109
## 5  5   Sarah  22          B   110 223
## 6  6     Dan  41          B   246  76
## 7  7   Kenny  28          O   196  90
## 8  8 Derrick  31          A   248 564

* Adding columns to a data frame with rounded values

x
##   SN    Name Age Blood.Type AgeSN  ID
## 1  1  Samuel  23          A    23 121
## 2  2    Dory  26          A    52 231
## 3  3     Ken  24          B    72 452
## 4  4   Danny  22          B    88 109
## 5  5   Sarah  22          B   110 223
## 6  6     Dan  41          B   246  76
## 7  7   Kenny  28          O   196  90
## 8  8 Derrick  31          A   248 564
nums <- c(1.21,2.23,3.12,4.222,4.4,7.888,1.1,1.0)
x$nums <- nums
x$nums <- round(x$nums,1)

x
##   SN    Name Age Blood.Type AgeSN  ID nums
## 1  1  Samuel  23          A    23 121  1.2
## 2  2    Dory  26          A    52 231  2.2
## 3  3     Ken  24          B    72 452  3.1
## 4  4   Danny  22          B    88 109  4.2
## 5  5   Sarah  22          B   110 223  4.4
## 6  6     Dan  41          B   246  76  7.9
## 7  7   Kenny  28          O   196  90  1.1
## 8  8 Derrick  31          A   248 564  1.0


Deleting from a data frame in R

* Delete a variable of a data frame

x
##   SN    Name Age Blood.Type AgeSN  ID nums
## 1  1  Samuel  23          A    23 121  1.2
## 2  2    Dory  26          A    52 231  2.2
## 3  3     Ken  24          B    72 452  3.1
## 4  4   Danny  22          B    88 109  4.2
## 5  5   Sarah  22          B   110 223  4.4
## 6  6     Dan  41          B   246  76  7.9
## 7  7   Kenny  28          O   196  90  1.1
## 8  8 Derrick  31          A   248 564  1.0
x$ID <- NULL

x
##   SN    Name Age Blood.Type AgeSN nums
## 1  1  Samuel  23          A    23  1.2
## 2  2    Dory  26          A    52  2.2
## 3  3     Ken  24          B    72  3.1
## 4  4   Danny  22          B    88  4.2
## 5  5   Sarah  22          B   110  4.4
## 6  6     Dan  41          B   246  7.9
## 7  7   Kenny  28          O   196  1.1
## 8  8 Derrick  31          A   248  1.0

* Delete a row from a data frame

x
##   SN    Name Age Blood.Type AgeSN nums
## 1  1  Samuel  23          A    23  1.2
## 2  2    Dory  26          A    52  2.2
## 3  3     Ken  24          B    72  3.1
## 4  4   Danny  22          B    88  4.2
## 5  5   Sarah  22          B   110  4.4
## 6  6     Dan  41          B   246  7.9
## 7  7   Kenny  28          O   196  1.1
## 8  8 Derrick  31          A   248  1.0
x <- x[-2,]

x
##   SN    Name Age Blood.Type AgeSN nums
## 1  1  Samuel  23          A    23  1.2
## 3  3     Ken  24          B    72  3.1
## 4  4   Danny  22          B    88  4.2
## 5  5   Sarah  22          B   110  4.4
## 6  6     Dan  41          B   246  7.9
## 7  7   Kenny  28          O   196  1.1
## 8  8 Derrick  31          A   248  1.0


Reading from local data files

* Read txt files using the read.table() function and csv files using the read.csv() function

df <- read.table("wvvi.us.txt")

df <- read.csv("wwd.us.csv")