R DataFrame

Dataframe is a two dimensional data structure in R that consists of rows and columns. It is a special case of a list which has each component of equal length. The difference between dataframe and matrix is that in each column of a dataframe can be different types of elements: character, integer, boolean and etc.

1. Create a Dataframe

The dataframe below is created. It consists of boolean, integer and character elements.

name <- c("Fatimah Nizam", "Basyir Nizam", "Adam Sinclair", "Harry Styles",
          "Maisarah Zairi", "Fateh Malik","Henry Golding", "Kendall Jenner", "Michael Jackson", "Gigi Hadid")
age <- c(24, 15, 30, 29, 17, 16, 35, 25, 14, 15)
teen <- c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,TRUE, TRUE)
people <- data.frame(name,age,teen, stringsAsFactors = FALSE)

people
##               name age  teen
## 1    Fatimah Nizam  24 FALSE
## 2     Basyir Nizam  15  TRUE
## 3    Adam Sinclair  30 FALSE
## 4     Harry Styles  29 FALSE
## 5   Maisarah Zairi  17  TRUE
## 6      Fateh Malik  16  TRUE
## 7    Henry Golding  35 FALSE
## 8   Kendall Jenner  25 FALSE
## 9  Michael Jackson  14  TRUE
## 10      Gigi Hadid  15  TRUE

A dataframe named people is created. The argument stringsAsFactors is a logical argument used to indicate whether strings in a data frame should be treated as factor variables or just plain strings.

To check the attribute of the dataframe, class() function can be used.

class(people)
## [1] "data.frame"

typeof() function is used to access type of an object, it is more specific than class() function.

typeof(people)
## [1] "list"

From the result we can see that each object inside a dataframe is a list. This is what defined a dataframe.

To check the length of the Dataframe:

length(people)
## [1] 3

To assign names to the row of the DataFframe:

row.names(people) <- c("name1","name2","name3","name4","name5","name6","name7","name8","name9", "name10")
people
##                   name age  teen
## name1    Fatimah Nizam  24 FALSE
## name2     Basyir Nizam  15  TRUE
## name3    Adam Sinclair  30 FALSE
## name4     Harry Styles  29 FALSE
## name5   Maisarah Zairi  17  TRUE
## name6      Fateh Malik  16  TRUE
## name7    Henry Golding  35 FALSE
## name8   Kendall Jenner  25 FALSE
## name9  Michael Jackson  14  TRUE
## name10      Gigi Hadid  15  TRUE

2. Ways to rename column Dataframe

There are many ways to rename the column. Solution 1 is using the name () function.

names(people) <- c("Full Name", "Age", "Teenager")
people
##              Full Name Age Teenager
## name1    Fatimah Nizam  24    FALSE
## name2     Basyir Nizam  15     TRUE
## name3    Adam Sinclair  30    FALSE
## name4     Harry Styles  29    FALSE
## name5   Maisarah Zairi  17     TRUE
## name6      Fateh Malik  16     TRUE
## name7    Henry Golding  35    FALSE
## name8   Kendall Jenner  25    FALSE
## name9  Michael Jackson  14     TRUE
## name10      Gigi Hadid  15     TRUE

Solution 2 is to assign new names in the dataframe() function.

people <- data.frame(Full_Name = name, Age = age, Teenager = teen)
people
##          Full_Name Age Teenager
## 1    Fatimah Nizam  24    FALSE
## 2     Basyir Nizam  15     TRUE
## 3    Adam Sinclair  30    FALSE
## 4     Harry Styles  29    FALSE
## 5   Maisarah Zairi  17     TRUE
## 6      Fateh Malik  16     TRUE
## 7    Henry Golding  35    FALSE
## 8   Kendall Jenner  25    FALSE
## 9  Michael Jackson  14     TRUE
## 10      Gigi Hadid  15     TRUE

3. Determine the number of rows and columns

nrow() and ncol() function are used to identify the number of rows and columns inside a dataframe. paste0 is used in print() function to ensure that the variable row and column are printed align (same row) with the strings.

row <- nrow(people)
print(paste0("Number of rows: ", row))
## [1] "Number of rows: 10"
col <- ncol(people)
print(paste0("Number of column: ", col))
## [1] "Number of column: 3"

To identify the dimension of a dataframe, dim() function is used.

dim(people)
## [1] 10  3

4. Accessing the components of a dataframe

To access the first 6 rows of a dataframe, head() function is used. To access the last 6 rows of a dataframe, tail() function is used.

print("The first 6 rows")
## [1] "The first 6 rows"
head(people)
##        Full_Name Age Teenager
## 1  Fatimah Nizam  24    FALSE
## 2   Basyir Nizam  15     TRUE
## 3  Adam Sinclair  30    FALSE
## 4   Harry Styles  29    FALSE
## 5 Maisarah Zairi  17     TRUE
## 6    Fateh Malik  16     TRUE
print("The last 6 rows")
## [1] "The last 6 rows"
tail(people)
##          Full_Name Age Teenager
## 5   Maisarah Zairi  17     TRUE
## 6      Fateh Malik  16     TRUE
## 7    Henry Golding  35    FALSE
## 8   Kendall Jenner  25    FALSE
## 9  Michael Jackson  14     TRUE
## 10      Gigi Hadid  15     TRUE

To access all of the rows except the last one:

people[1:9,]
##         Full_Name Age Teenager
## 1   Fatimah Nizam  24    FALSE
## 2    Basyir Nizam  15     TRUE
## 3   Adam Sinclair  30    FALSE
## 4    Harry Styles  29    FALSE
## 5  Maisarah Zairi  17     TRUE
## 6     Fateh Malik  16     TRUE
## 7   Henry Golding  35    FALSE
## 8  Kendall Jenner  25    FALSE
## 9 Michael Jackson  14     TRUE

To exclude the second row from the dataframe:

people[-2,]
##          Full_Name Age Teenager
## 1    Fatimah Nizam  24    FALSE
## 3    Adam Sinclair  30    FALSE
## 4     Harry Styles  29    FALSE
## 5   Maisarah Zairi  17     TRUE
## 6      Fateh Malik  16     TRUE
## 7    Henry Golding  35    FALSE
## 8   Kendall Jenner  25    FALSE
## 9  Michael Jackson  14     TRUE
## 10      Gigi Hadid  15     TRUE

ways to select a single element from a dataframe:

people[3,2]
## [1] 30
people[3,"Age"]
## [1] 30

There is a way to select several specific elements from data frame. For example, we want to know the details for Fatimah Nizam and Gigi Hadid.

people[c(1,10),c("Age", "Teenager")]
##    Age Teenager
## 1   24    FALSE
## 10  15     TRUE

The names Fatimah Nizam and Gigi Hadid is not displayed since the Full Name column is not selected.

5. Accessing the columns’ elements of a dataframe

There are a few ways to access the column. The $ symbol and [ ] bracket can be used to access the column’s elements.

people$Age
##  [1] 24 15 30 29 17 16 35 25 14 15
people[["Age"]]
##  [1] 24 15 30 29 17 16 35 25 14 15
people["Age"]
##    Age
## 1   24
## 2   15
## 3   30
## 4   29
## 5   17
## 6   16
## 7   35
## 8   25
## 9   14
## 10  15

You can see a different result from people$age, people[[“Age”]] and people[“Age”]. The first twos show vector result and the latter one shows a dataframe result. Remember, dataframe is actually a list containing all vectors of the same length.

It will give you the same result when you use the index of the column to access its element. For example : people[2] and people[[2]].

6. Extending the dataframe

There are several ways to add a column to a dataframe. The solution 1 is to assign the column variable to the people dataframe.

## Height column variable is created.
height <- c(165, 177, 163, 162, 157, 170, 180, 167, 175, 171)
# The column variable is then assigned to the dataframe
people$height <- height
#or
people[["height"]] <- height
# result
people
##          Full_Name Age Teenager height
## 1    Fatimah Nizam  24    FALSE    165
## 2     Basyir Nizam  15     TRUE    177
## 3    Adam Sinclair  30    FALSE    163
## 4     Harry Styles  29    FALSE    162
## 5   Maisarah Zairi  17     TRUE    157
## 6      Fateh Malik  16     TRUE    170
## 7    Henry Golding  35    FALSE    180
## 8   Kendall Jenner  25    FALSE    167
## 9  Michael Jackson  14     TRUE    175
## 10      Gigi Hadid  15     TRUE    171

The solution 2 is to use the cbind() function.

weight <- c(58, 63, 68, 55, 56, 70, 64, 65, 75, 55)
cbind(people, weight)
##          Full_Name Age Teenager height weight
## 1    Fatimah Nizam  24    FALSE    165     58
## 2     Basyir Nizam  15     TRUE    177     63
## 3    Adam Sinclair  30    FALSE    163     68
## 4     Harry Styles  29    FALSE    162     55
## 5   Maisarah Zairi  17     TRUE    157     56
## 6      Fateh Malik  16     TRUE    170     70
## 7    Henry Golding  35    FALSE    180     64
## 8   Kendall Jenner  25    FALSE    167     65
## 9  Michael Jackson  14     TRUE    175     75
## 10      Gigi Hadid  15     TRUE    171     55

But there is a problem by using cbind( ) function. The column assigned is not included in the people dataframe.

people
##          Full_Name Age Teenager height
## 1    Fatimah Nizam  24    FALSE    165
## 2     Basyir Nizam  15     TRUE    177
## 3    Adam Sinclair  30    FALSE    163
## 4     Harry Styles  29    FALSE    162
## 5   Maisarah Zairi  17     TRUE    157
## 6      Fateh Malik  16     TRUE    170
## 7    Henry Golding  35    FALSE    180
## 8   Kendall Jenner  25    FALSE    167
## 9  Michael Jackson  14     TRUE    175
## 10      Gigi Hadid  15     TRUE    171

The solution is to actually assign the cbind ( ) function into people dataframe.

weight <- c(58, 63, 68, 55, 56, 70, 64, 65, 75, 55)
people <- cbind(people, weight)
people
##          Full_Name Age Teenager height weight
## 1    Fatimah Nizam  24    FALSE    165     58
## 2     Basyir Nizam  15     TRUE    177     63
## 3    Adam Sinclair  30    FALSE    163     68
## 4     Harry Styles  29    FALSE    162     55
## 5   Maisarah Zairi  17     TRUE    157     56
## 6      Fateh Malik  16     TRUE    170     70
## 7    Henry Golding  35    FALSE    180     64
## 8   Kendall Jenner  25    FALSE    167     65
## 9  Michael Jackson  14     TRUE    175     75
## 10      Gigi Hadid  15     TRUE    171     55

Now we will like to add a row into the Dataframe:

tom <- data.frame(Full_Name = "Tom Riddle", Age= 37,
                  Teenager = FALSE, height = 183, weight = 80)
people <- rbind(people, tom)
people
##          Full_Name Age Teenager height weight
## 1    Fatimah Nizam  24    FALSE    165     58
## 2     Basyir Nizam  15     TRUE    177     63
## 3    Adam Sinclair  30    FALSE    163     68
## 4     Harry Styles  29    FALSE    162     55
## 5   Maisarah Zairi  17     TRUE    157     56
## 6      Fateh Malik  16     TRUE    170     70
## 7    Henry Golding  35    FALSE    180     64
## 8   Kendall Jenner  25    FALSE    167     65
## 9  Michael Jackson  14     TRUE    175     75
## 10      Gigi Hadid  15     TRUE    171     55
## 11      Tom Riddle  37    FALSE    183     80

7. Sorting and re-ordering a dataframe

We can sort the column’s elements. For example, we select the Age column.

x <- sort(people$Age)
x
##  [1] 14 15 15 16 17 24 25 29 30 35 37

We can also re-ordering the elements.

ranks <- order(people$Age)
ranks
##  [1]  9  2 10  6  5  1  8  4  3  7 11
people
##          Full_Name Age Teenager height weight
## 1    Fatimah Nizam  24    FALSE    165     58
## 2     Basyir Nizam  15     TRUE    177     63
## 3    Adam Sinclair  30    FALSE    163     68
## 4     Harry Styles  29    FALSE    162     55
## 5   Maisarah Zairi  17     TRUE    157     56
## 6      Fateh Malik  16     TRUE    170     70
## 7    Henry Golding  35    FALSE    180     64
## 8   Kendall Jenner  25    FALSE    167     65
## 9  Michael Jackson  14     TRUE    175     75
## 10      Gigi Hadid  15     TRUE    171     55
## 11      Tom Riddle  37    FALSE    183     80

We can see that Age 14 is the lowest and it belongs to Michael Jackson. Its index 9, comes first in the rank. Tom Riddle is the oldest, with 37 of Age. Its index 11, comes last in the rank.

To reorder the Dataframe according to its Age elements from the oldest to the youngest:

people[order(people$Age, decreasing = TRUE), ]
##          Full_Name Age Teenager height weight
## 11      Tom Riddle  37    FALSE    183     80
## 7    Henry Golding  35    FALSE    180     64
## 3    Adam Sinclair  30    FALSE    163     68
## 4     Harry Styles  29    FALSE    162     55
## 8   Kendall Jenner  25    FALSE    167     65
## 1    Fatimah Nizam  24    FALSE    165     58
## 5   Maisarah Zairi  17     TRUE    157     56
## 6      Fateh Malik  16     TRUE    170     70
## 2     Basyir Nizam  15     TRUE    177     63
## 10      Gigi Hadid  15     TRUE    171     55
## 9  Michael Jackson  14     TRUE    175     75

8. To store the Dataframe as a factor

str() function is used to store the Dataframe as a factor.

str(people)
## 'data.frame':    11 obs. of  5 variables:
##  $ Full_Name: chr  "Fatimah Nizam" "Basyir Nizam" "Adam Sinclair" "Harry Styles" ...
##  $ Age      : num  24 15 30 29 17 16 35 25 14 15 ...
##  $ Teenager : logi  FALSE TRUE FALSE FALSE TRUE TRUE ...
##  $ height   : num  165 177 163 162 157 170 180 167 175 171 ...
##  $ weight   : num  58 63 68 55 56 70 64 65 75 55 ...

9. Summarization of the DataFrame

Last, to summarize the Dataframe:

summary(people)
##   Full_Name              Age         Teenager           height   
##  Length:11          Min.   :14.00   Mode :logical   Min.   :157  
##  Class :character   1st Qu.:15.50   FALSE:6         1st Qu.:164  
##  Mode  :character   Median :24.00   TRUE :5         Median :170  
##                     Mean   :23.36                   Mean   :170  
##                     3rd Qu.:29.50                   3rd Qu.:176  
##                     Max.   :37.00                   Max.   :183  
##      weight     
##  Min.   :55.00  
##  1st Qu.:57.00  
##  Median :64.00  
##  Mean   :64.45  
##  3rd Qu.:69.00  
##  Max.   :80.00

Thank you and I hope that solves EVERYTHING for R Dataframe!