Hello, everyone! We are going to demonstrate some great features about Data Frame now.

What is Data Frame?

Data frame is a two dimensional data structure in R. Its components would be having the same length.

Dataset source

Created by ownself, with sample of data consists of 5 individuals (or we called as observations or records) and 4 different demographic data variables (or we called as variables).Thus, in other words, there would be 5 rows and 4 columns in the output. Let’s begin the adventure!

Loading Library

library(kableExtra)

Let’s begin!

Step 1: Creating our data frame

To create our data frame, there are few ways to carry out. We will demonstrate 2 ways now, such as:

By using function data.frame()
By having vector first, before turning into data frame

1. By using function data.frame()

mydata<-data.frame(Name=c("Ali","Bella","Chong","Deva","Elena"),Age=c(23,26,22,25,28),Sex=c("Male","Female","Male","Male","Female"),Salary=c(2500,3000,2800,4000,3700))
mydata

##    Name Age    Sex Salary
## 1   Ali  23   Male   2500
## 2 Bella  26 Female   3000
## 3 Chong  22   Male   2800
## 4  Deva  25   Male   4000
## 5 Elena  28 Female   3700

1. By having vector first, before turning into data frame

Name=c("Ali","Bella","Chong","Deva","Elena")
Age=c(23,26,22,25,28)
Sex=c("Male","Female","Male","Male","Female")
Salary=c(2500,3000,2800,4000,3700)
mydata=data.frame(Name,Age,Sex,Salary)
mydata

##    Name Age    Sex Salary
## 1   Ali  23   Male   2500
## 2 Bella  26 Female   3000
## 3 Chong  22   Male   2800
## 4  Deva  25   Male   4000
## 5 Elena  28 Female   3700

We are going to use function called kable to enhance the arrangement of the table to ease our reading.

knitr::kable(head(mydata[, 1:4]), "simple",align = "lccrr",bootstrap_options="striped",font_size=10,full_width=F)

Name	Age	Sex	Salary
Ali	23	Male	2500
Bella	26	Female	3000
Chong	22	Male	2800
Deva	25	Male	4000
Elena	28	Female	3700

Step 2: Let’s explore some functions

We are now going to explore 5 functions, namely: 1. names() 2. row.names() 3. nrow() 4. ncol() 5. length()

1. names()

We would be able to get the variable names as our output.

names(mydata)

## [1] "Name"   "Age"    "Sex"    "Salary"

1. row.names()

Similarly to names() function, but row.names() would provide us with the names of the row or observations.By default, it would be in number, like 1 to 5.We will proceed in next section to change the name.

row.names(mydata)

## [1] "1" "2" "3" "4" "5"

1. nrow()

We would be able to get the number of rows, or number of observations in our data, which is 5.

nrow(mydata)

## [1] 5

1. ncol()

We would be able to get the number of columns, or number of variables in our data, which is 4.

ncol(mydata)

## [1] 4

1. length()

Similarly to ncol() function, we would be able to get the number of variables in our data.

length(mydata)

## [1] 4

Step 3: How to change row and column name?

There are times when we wanted to change our row or column names, in order to correct it to the right one. There are ways to do so:

1. To change row names We could use function row.names()<- xx to change to the name we want. For instance, we would like to change the number 1 to 5, to A,B,C,D,E.

row.names(mydata)<-c("A","B","C","D","E")
mydata

##    Name Age    Sex Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   2800
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

1. To change column names There are few ways to carry out.For instance, if we wanted to change the name of sex to gender.

i) Method 1

names(mydata)[names(mydata) == 'Sex'] <- 'Gender'
mydata

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   2800
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

ii) Method 2

names(mydata)<-c("Name","Age", "Gender","Salary")
mydata

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   2800
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

Step 4: How to access certain component?

Sometimes, we would like to access to certain component in our data frame only. This could be done easily by using few ways. For instance, we would like to know the age of the individuals.

Access in List style

1. Method 1: This method would returned result in a form of data frame (as shown below). [Question: What if we would like to have the result in vector form? (Refer Method 2)]

mydata['Age']

##   Age
## A  23
## B  26
## C  22
## D  25
## E  28

1. Method 2: There are 3 ways to return result in a form of vector.

A. First way

mydata$Age

## [1] 23 26 22 25 28

B. Second way

mydata[["Age"]]

## [1] 23 26 22 25 28

C. Third way

mydata[[2]]

## [1] 23 26 22 25 28

Additionally, to dive in further, we are able to access to the element in a particular column from Second and Third way above.For instance, we would like to access to age of 25, which was at 4th position.

From Second way

mydata[["Age"]][4]

## [1] 25

From Third way

mydata[[2]][4]

## [1] 25

Access in Matrix style

mydata

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   2800
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

By referring back to the data above, if we would like to access to certain data only, by using matrix style, we could do so in few ways below.

i) Scenario 1: To access row 2 of our data.

mydata[2,]

##    Name Age Gender Salary
## B Bella  26 Female   3000

ii) Scenario 2: To access column ‘Salary’ of our data, which was at fourth column.

mydata[,4]

## [1] 2500 3000 2800 4000 3700

iii) Scenario 3: To access ‘Salary’ of Bella.

mydata[2,4]

## [1] 3000

Additionally, to dive in further, the output above was in the form of vector. If we would like to get the output in the form of data base, just one single extra step needed, to add in drop=FALSE.

By using scenario 2 above,

i) Result in vector form

checkvector<-mydata[,4]
class(checkvector)

## [1] "numeric"

ii) Result in data frame form

checkdataframe<-mydata[,4,drop=FALSE]
class(checkdataframe)

## [1] "data.frame"

Also, extra care is needed if we using [] and [[]] to access. These 2 types would provid us different results. For instance,

i) [] would provide us result in data frame

mydata["Age"]

##   Age
## A  23
## B  26
## C  22
## D  25
## E  28

class(mydata["Age"])

## [1] "data.frame"

ii) [[]] would provide us result in vector

mydata[["Age"]]

## [1] 23 26 22 25 28

class(mydata[["Age"]])

## [1] "numeric"

Step 5: Add, modify and remove data

There would be cases where we need to add in new data to our existing data or to modify some of the data that were incorrectly entered or maybe to remove some data. This could be done by using methods below.

Before that, let’s have a glance once again at our data set.

mydata

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   2800
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

1) To modify data

For instance, the salary of Chong was incorrectly entered, it was supposed to be 3300.

mydata[3,"Salary"]<-3300
mydata

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   3300
## D  Deva  25   Male   4000
## E Elena  28 Female   3700

2) To add new data

A. rbind; if we would like add new observation (row data)

AddFariz<-rbind(mydata,list("Fariz",29,"Male",3700))
AddFariz

##    Name Age Gender Salary
## A   Ali  23   Male   2500
## B Bella  26 Female   3000
## C Chong  22   Male   3300
## D  Deva  25   Male   4000
## E Elena  28 Female   3700
## 6 Fariz  29   Male   3700

B. cbind; if we would like add new variable (column data)

AddRace<-cbind(mydata,Race=c("Malay","Chinese","Chinese","Indian","Malay"))
AddRace

##    Name Age Gender Salary    Race
## A   Ali  23   Male   2500   Malay
## B Bella  26 Female   3000 Chinese
## C Chong  22   Male   3300 Chinese
## D  Deva  25   Male   4000  Indian
## E Elena  28 Female   3700   Malay

C. $; if we would like add new variable (column data)

mydata$Race<-c("Malay","Chinese","Chinese","Indian","Malay")
mydata

##    Name Age Gender Salary    Race
## A   Ali  23   Male   2500   Malay
## B Bella  26 Female   3000 Chinese
## C Chong  22   Male   3300 Chinese
## D  Deva  25   Male   4000  Indian
## E Elena  28 Female   3700   Malay

3) To remove data

There are 2 ways to carry out, for instance, we would like to remove data of Deva.

A. By using index style

mydata1<-mydata[-4,]
mydata1

##    Name Age Gender Salary    Race
## A   Ali  23   Male   2500   Malay
## B Bella  26 Female   3000 Chinese
## C Chong  22   Male   3300 Chinese
## E Elena  28 Female   3700   Malay

In this scenario below, let’s say we would like to remove ‘Age’ from our data in above example.

B. By using NULL in $

mydata1$Age<-NULL
mydata1

##    Name Gender Salary    Race
## A   Ali   Male   2500   Malay
## B Bella Female   3000 Chinese
## C Chong   Male   3300 Chinese
## E Elena Female   3700   Malay

Step 6: To check structure and summary of data

Lastly, we might wish to check on the structure and summary of our data before proceed further for analysis. This could be done by:

1) str()

str(mydata)

## 'data.frame':    5 obs. of  5 variables:
##  $ Name  : chr  "Ali" "Bella" "Chong" "Deva" ...
##  $ Age   : num  23 26 22 25 28
##  $ Gender: chr  "Male" "Female" "Male" "Male" ...
##  $ Salary: num  2500 3000 3300 4000 3700
##  $ Race  : chr  "Malay" "Chinese" "Chinese" "Indian" ...

2) summary()

summary(mydata)

##      Name                Age          Gender              Salary    
##  Length:5           Min.   :22.0   Length:5           Min.   :2500  
##  Class :character   1st Qu.:23.0   Class :character   1st Qu.:3000  
##  Mode  :character   Median :25.0   Mode  :character   Median :3300  
##                     Mean   :24.8                      Mean   :3300  
##                     3rd Qu.:26.0                      3rd Qu.:3700  
##                     Max.   :28.0                      Max.   :4000  
##      Race          
##  Length:5          
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Step 7: Some additional fun features!

In addition, we could do much more things with data frame. Let’s explore some of them as below:

1) If we would like to retrieve data of only 2 respondents, Bella and Deva.

mydata[c(2,4),]

##    Name Age Gender Salary    Race
## B Bella  26 Female   3000 Chinese
## D  Deva  25   Male   4000  Indian

2) Using logical flag vectors to subset our data

Firstly, let’s check on how many male in our data.

mydata$Gender=="Male"

## [1]  TRUE FALSE  TRUE  TRUE FALSE

As we could observe, there were 3 males, at position 1, 3 and 4. Then, we proceed further by subsetting only male data.

mydata[mydata$Gender=="Male",]

##    Name Age Gender Salary    Race
## A   Ali  23   Male   2500   Malay
## C Chong  22   Male   3300 Chinese
## D  Deva  25   Male   4000  Indian

Thus, we could subset the data as above. Furthermore, if we wish to remove the data of age, from our subsetted data of male individuals, then:

mydata[mydata$Gender=="Male",-2]

##    Name Gender Salary    Race
## A   Ali   Male   2500   Malay
## C Chong   Male   3300 Chinese
## D  Deva   Male   4000  Indian

3) To add value to a variable

For instance, there is an incremental to the salary of all the individuals by 500. Thus, by adding 500, the results could be obtained.

mydata["Salary"]+500

##   Salary
## A   3000
## B   3500
## C   3800
## D   4500
## E   4200

4) Dim function

Dim function could give us the number of obervations and variables in our data set.

dim(mydata)

## [1] 5 5

5) Adding new variable by using data on existing data

For instance, the salary in our exisitng data was in MYR. We would like to have a new variable which computing the salary in USD.

mydata$SalaryUSD<-mydata$Salary*0.24
mydata

##    Name Age Gender Salary    Race SalaryUSD
## A   Ali  23   Male   2500   Malay       600
## B Bella  26 Female   3000 Chinese       720
## C Chong  22   Male   3300 Chinese       792
## D  Deva  25   Male   4000  Indian       960
## E Elena  28 Female   3700   Malay       888

6) No record

In usual cases that shown above, we always having data shown in the output. However, there might be cases where no result obtained because did not satisfy the condition stated. For instance, we would like to find out if anyone having salary above 5000.

mydata[mydata$Salary>5000,]

## [1] Name      Age       Gender    Salary    Race      SalaryUSD
## <0 rows> (or 0-length row.names)

7) Having more than 1 condition

If we would like extract data with male individuals OR having salary above 3500. It would provide us any data which satisfied either one of the criteria.

mydata4<-mydata[mydata$Gender=="Male"|mydata$Salary>3500,]
mydata4

##    Name Age Gender Salary    Race SalaryUSD
## A   Ali  23   Male   2500   Malay       600
## C Chong  22   Male   3300 Chinese       792
## D  Deva  25   Male   4000  Indian       960
## E Elena  28 Female   3700   Malay       888

If we would like extract data with male individuals AND having salary above 3500. It would only provide us any data which satisfied BOTH of the criteria.

mydata5<-mydata[mydata$Gender=="Male"& mydata$Salary>3500,]
mydata5

##   Name Age Gender Salary   Race SalaryUSD
## D Deva  25   Male   4000 Indian       960

8) typeof() and class()

By using these 2 functions, we are able to differentiate the different outputs provided. The function class() would tell us that the data is data frame, whereas typeof() would returned with result of list. We will use the function class() to check data type of variable.

A. typeof()

typeof(mydata)

## [1] "list"

B. class()

class(mydata)

## [1] "data.frame"

9) data.frame()

To confirm whether the data is data frame, we could also check it using function is.dataframe().

is.data.frame(mydata)

## [1] TRUE

10) head() & tail()

To retrieve only first few rows of our data, or last few rows of our data, function head() and tail() could be performed respectively.For instance, let’s retrieve first three rows and last three rows using these 2 functions (as shown below).

A. head()

head(mydata,3)

##    Name Age Gender Salary    Race SalaryUSD
## A   Ali  23   Male   2500   Malay       600
## B Bella  26 Female   3000 Chinese       720
## C Chong  22   Male   3300 Chinese       792

B. tail()

tail(mydata,3)

##    Name Age Gender Salary    Race SalaryUSD
## C Chong  22   Male   3300 Chinese       792
## D  Deva  25   Male   4000  Indian       960
## E Elena  28 Female   3700   Malay       888

11) view()

In usual case, view function would be ran at the beginning of the project to study on the data set in R. It would put our data in a table form just like table in Excel file. Since our data is not too big, we did not use it at the beginning. However, if there was big amount of data, this function would be very useful. It allowed us to also sort the data from smallest to largest, A to Z etc. Also, we could search the data in a quicker way as well as filter to the data we preferred.

view(mydata)

Conclusion

All in all, it was a fun experience to explore data frame. It provided us many ways to obtain the result that we wanted. Happy leaning!

What is Data Frame?

Lim Kevin, Matric Number 17140821

11/8/2020

What is Data Frame?

Dataset source

Loading Library

Let’s begin!

Conclusion

Thank you!