DataFrame

A data frame is a table or a two-dimensional array like structure in which each column contains values of one variable and each row contains one set of values from each column.

Charateristics of a Data Frame

The following are the characteristics of a Data Frame:

The column names should be non-empty
The row names should be unique
The data stored in a data frame can be numeric, factor or character type.
Each column should contain same number of data items.

Creating a Data Frames

To create a data frame, we shall use the data frame function called data.frame().

For example, this code creates a data frame of five employees information with their respective names, age, salaries, profession.

employees_info <- data.frame(emp_names = c("Frank", "Micheal", 
    "Peter", "Melissa", "Esther"), emp_age = c(43, 32, 40, 41, 
    45), salaries = c(1e+05, 2e+05, 50000, 5e+05, 70000), emp_profession = c("Data Scientist", 
    "Senior It Admin", "associate accountant", "Excutive director", 
    "junior data scientist"), stringsAsFactors = FALSE)
print(employees_info)
  emp_names emp_age salaries        emp_profession
1     Frank      43    1e+05        Data Scientist
2   Micheal      32    2e+05       Senior It Admin
3     Peter      40    5e+04  associate accountant
4   Melissa      41    5e+05     Excutive director
5    Esther      45    7e+04 junior data scientist

Also, to find class of an object, use class() function. For example, the class of employees_info object is data.frame.

Additionally, the structure of the the data frame could be found using the str() function. For example, we could find the structure of the dataframe created from the previous example like this:

str(employees_info)
'data.frame':   5 obs. of  4 variables:
 $ emp_names     : chr  "Frank" "Micheal" "Peter" "Melissa" ...
 $ emp_age       : num  43 32 40 41 45
 $ salaries      : num  1e+05 2e+05 5e+04 5e+05 7e+04
 $ emp_profession: chr  "Data Scientist" "Senior It Admin" "associate accountant" "Excutive director" ...

Summary of Variables in Data Frame

The summary() function was applied here to encapsulate each variable in the Data Frame. As an example, apply summary() function to the employee_info data frame created earlier. While emp_age and salaries variables are numerical variables, emp_names and emp_profession are character variables.

summary(employees_info)
  emp_names            emp_age        salaries      emp_profession    
 Length:5           Min.   :32.0   Min.   : 50000   Length:5          
 Class :character   1st Qu.:40.0   1st Qu.: 70000   Class :character  
 Mode  :character   Median :41.0   Median :100000   Mode  :character  
                    Mean   :40.2   Mean   :184000                     
                    3rd Qu.:43.0   3rd Qu.:200000                     
                    Max.   :45.0   Max.   :500000

Manipulating column names and row names

To find the row names of any data frame, apply the rownames() function to data frame. For example, Using the employees_info earlier created, rownames(employees_info) returns the names for each row in the data frame, employees_info.

rownames(employees_info)
[1] "1" "2" "3" "4" "5"

Also, we can use rownames() function to change add row names. For example, we add row names to employee_info using the following method

rownames(employees_info) <- c("row1", "row2", "row3", "row4", 
    "row5")
employees_info
     emp_names emp_age salaries        emp_profession
row1     Frank      43    1e+05        Data Scientist
row2   Micheal      32    2e+05       Senior It Admin
row3     Peter      40    5e+04  associate accountant
row4   Melissa      41    5e+05     Excutive director
row5    Esther      45    7e+04 junior data scientist

Furthermore, colnames() function can be applied to any data frame to return name of each variable. Accordingly, the following code returns variable names for each column in the employee_info data frame. Also, we could use dimnames() function as well.

colnames(employees_info)
[1] "emp_names"      "emp_age"        "salaries"       "emp_profession"
dimnames(employees_info)
[[1]]
[1] "row1" "row2" "row3" "row4" "row5"

[[2]]
[1] "emp_names"      "emp_age"        "salaries"       "emp_profession"

Just like the way we add row names to each row, we could change column’s names to another name using colnames() function like the following example:

colnames(employees_info) <- c("employee_names", "employee_age", 
    "employee_salaries", "employee_profession")
employees_info
     employee_names employee_age employee_salaries   employee_profession
row1          Frank           43             1e+05        Data Scientist
row2        Micheal           32             2e+05       Senior It Admin
row3          Peter           40             5e+04  associate accountant
row4        Melissa           41             5e+05     Excutive director
row5         Esther           45             7e+04 junior data scientist

Indexing Data Frames

Elements in data frames are selected using data frame’s indexing. For instance, we could select the first and second rows for the first and the second column.

employees_info[1:2, 1:2]
     employee_names employee_age
row1          Frank           43
row2        Micheal           32

It is possible to extract row’s elements with a single column. In the following example, the first, second and third elements of the first column was extracted from the employees_info data frame.

employees_info[[1]][1:3]
[1] "Frank"   "Micheal" "Peter"

cbind() and rbind() with data frame.

The meaning of cbind() is column bind. cbind() function is used to combine vectors, matrices, and data frames by columns. As case in point, the code that follows applied cbind() function to combine data_info with v3 vector by column to form a data frame.

# Create a dataframe
data_info <- data.frame(col1 = 1:3, col2 = c(2, 3, 4), col3 = c(120, 
    130, 150))

# create a vector of decimal number
v3 <- c(12.23, 22.2, 30.3)

# Combine a dataframe with the vector using cbind() function
cbind(data_info, v3)
  col1 col2 col3    v3
1    1    2  120 12.23
2    2    3  130 22.20
3    3    4  150 30.30

In the same vein, the name rbind means row-bind. The rbind() function can be used to combine vectors, matrices and/or data frames by row. However, caution must be observed when applying rbind() function for combination because this can cause changes in the columns classes if not handle properly. For instance, if we were to combine the data_info with v3, this will change all column class to a float class. In order to avoid this, we will not be using a vector but rather a data frame in this form:

df <- data.frame(col1 = 4, col2 = 5, col3 = 200)
rbind(data_info, df)
  col1 col2 col3
1    1    2  120
2    2    3  130
3    3    4  150
4    4    5  200