Hello, everyone! We are going to demonstrate some great features about Data Frame now.
Hello my name is Kevin, nice to meet you!
Data frame is a two dimensional data structure in R. Its components would be having the same length.
Created by ownself, with sample of data consists of 5 individuals (or we called as observations or records) and 4 different demographic data variables (or we called as variables).Thus, in other words, there would be 5 rows and 4 columns in the output. Let’s begin the adventure!
library(kableExtra)
Step 1: Creating our data frame
To create our data frame, there are few ways to carry out. We will demonstrate 2 ways now, such as:
By using function data.frame()
By having vector first, before turning into data frame
mydata<-data.frame(Name=c("Ali","Bella","Chong","Deva","Elena"),Age=c(23,26,22,25,28),Sex=c("Male","Female","Male","Male","Female"),Salary=c(2500,3000,2800,4000,3700))
mydata
## Name Age Sex Salary
## 1 Ali 23 Male 2500
## 2 Bella 26 Female 3000
## 3 Chong 22 Male 2800
## 4 Deva 25 Male 4000
## 5 Elena 28 Female 3700
Name=c("Ali","Bella","Chong","Deva","Elena")
Age=c(23,26,22,25,28)
Sex=c("Male","Female","Male","Male","Female")
Salary=c(2500,3000,2800,4000,3700)
mydata=data.frame(Name,Age,Sex,Salary)
mydata
## Name Age Sex Salary
## 1 Ali 23 Male 2500
## 2 Bella 26 Female 3000
## 3 Chong 22 Male 2800
## 4 Deva 25 Male 4000
## 5 Elena 28 Female 3700
We are going to use function called kable to enhance the arrangement of the table to ease our reading.
knitr::kable(head(mydata[, 1:4]), "simple",align = "lccrr",bootstrap_options="striped",font_size=10,full_width=F)
| Name | Age | Sex | Salary |
|---|---|---|---|
| Ali | 23 | Male | 2500 |
| Bella | 26 | Female | 3000 |
| Chong | 22 | Male | 2800 |
| Deva | 25 | Male | 4000 |
| Elena | 28 | Female | 3700 |
Step 2: Let’s explore some functions
We are now going to explore 5 functions, namely: 1. names() 2. row.names() 3. nrow() 4. ncol() 5. length()
We would be able to get the variable names as our output.
names(mydata)
## [1] "Name" "Age" "Sex" "Salary"
Similarly to names() function, but row.names() would provide us with the names of the row or observations.By default, it would be in number, like 1 to 5.We will proceed in next section to change the name.
row.names(mydata)
## [1] "1" "2" "3" "4" "5"
We would be able to get the number of rows, or number of observations in our data, which is 5.
nrow(mydata)
## [1] 5
We would be able to get the number of columns, or number of variables in our data, which is 4.
ncol(mydata)
## [1] 4
Similarly to ncol() function, we would be able to get the number of variables in our data.
length(mydata)
## [1] 4
Step 3: How to change row and column name?
There are times when we wanted to change our row or column names, in order to correct it to the right one. There are ways to do so:
row.names(mydata)<-c("A","B","C","D","E")
mydata
## Name Age Sex Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 2800
## D Deva 25 Male 4000
## E Elena 28 Female 3700
i) Method 1
names(mydata)[names(mydata) == 'Sex'] <- 'Gender'
mydata
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 2800
## D Deva 25 Male 4000
## E Elena 28 Female 3700
ii) Method 2
names(mydata)<-c("Name","Age", "Gender","Salary")
mydata
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 2800
## D Deva 25 Male 4000
## E Elena 28 Female 3700
Step 4: How to access certain component?
Sometimes, we would like to access to certain component in our data frame only. This could be done easily by using few ways. For instance, we would like to know the age of the individuals.
Access in List style
mydata['Age']
## Age
## A 23
## B 26
## C 22
## D 25
## E 28
A. First way
mydata$Age
## [1] 23 26 22 25 28
B. Second way
mydata[["Age"]]
## [1] 23 26 22 25 28
C. Third way
mydata[[2]]
## [1] 23 26 22 25 28
Additionally, to dive in further, we are able to access to the element in a particular column from Second and Third way above.For instance, we would like to access to age of 25, which was at 4th position.
From Second way
mydata[["Age"]][4]
## [1] 25
From Third way
mydata[[2]][4]
## [1] 25
Access in Matrix style
mydata
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 2800
## D Deva 25 Male 4000
## E Elena 28 Female 3700
By referring back to the data above, if we would like to access to certain data only, by using matrix style, we could do so in few ways below.
i) Scenario 1: To access row 2 of our data.
mydata[2,]
## Name Age Gender Salary
## B Bella 26 Female 3000
ii) Scenario 2: To access column ‘Salary’ of our data, which was at fourth column.
mydata[,4]
## [1] 2500 3000 2800 4000 3700
iii) Scenario 3: To access ‘Salary’ of Bella.
mydata[2,4]
## [1] 3000
Additionally, to dive in further, the output above was in the form of vector. If we would like to get the output in the form of data base, just one single extra step needed, to add in drop=FALSE.
By using scenario 2 above,
i) Result in vector form
checkvector<-mydata[,4]
class(checkvector)
## [1] "numeric"
ii) Result in data frame form
checkdataframe<-mydata[,4,drop=FALSE]
class(checkdataframe)
## [1] "data.frame"
Also, extra care is needed if we using [] and [[]] to access. These 2 types would provid us different results. For instance,
i) [] would provide us result in data frame
mydata["Age"]
## Age
## A 23
## B 26
## C 22
## D 25
## E 28
class(mydata["Age"])
## [1] "data.frame"
ii) [[]] would provide us result in vector
mydata[["Age"]]
## [1] 23 26 22 25 28
class(mydata[["Age"]])
## [1] "numeric"
Step 5: Add, modify and remove data
There would be cases where we need to add in new data to our existing data or to modify some of the data that were incorrectly entered or maybe to remove some data. This could be done by using methods below.
Before that, let’s have a glance once again at our data set.
mydata
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 2800
## D Deva 25 Male 4000
## E Elena 28 Female 3700
1) To modify data
For instance, the salary of Chong was incorrectly entered, it was supposed to be 3300.
mydata[3,"Salary"]<-3300
mydata
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 3300
## D Deva 25 Male 4000
## E Elena 28 Female 3700
2) To add new data
AddFariz<-rbind(mydata,list("Fariz",29,"Male",3700))
AddFariz
## Name Age Gender Salary
## A Ali 23 Male 2500
## B Bella 26 Female 3000
## C Chong 22 Male 3300
## D Deva 25 Male 4000
## E Elena 28 Female 3700
## 6 Fariz 29 Male 3700
AddRace<-cbind(mydata,Race=c("Malay","Chinese","Chinese","Indian","Malay"))
AddRace
## Name Age Gender Salary Race
## A Ali 23 Male 2500 Malay
## B Bella 26 Female 3000 Chinese
## C Chong 22 Male 3300 Chinese
## D Deva 25 Male 4000 Indian
## E Elena 28 Female 3700 Malay
mydata$Race<-c("Malay","Chinese","Chinese","Indian","Malay")
mydata
## Name Age Gender Salary Race
## A Ali 23 Male 2500 Malay
## B Bella 26 Female 3000 Chinese
## C Chong 22 Male 3300 Chinese
## D Deva 25 Male 4000 Indian
## E Elena 28 Female 3700 Malay
3) To remove data
There are 2 ways to carry out, for instance, we would like to remove data of Deva.
mydata1<-mydata[-4,]
mydata1
## Name Age Gender Salary Race
## A Ali 23 Male 2500 Malay
## B Bella 26 Female 3000 Chinese
## C Chong 22 Male 3300 Chinese
## E Elena 28 Female 3700 Malay
In this scenario below, let’s say we would like to remove ‘Age’ from our data in above example.
mydata1$Age<-NULL
mydata1
## Name Gender Salary Race
## A Ali Male 2500 Malay
## B Bella Female 3000 Chinese
## C Chong Male 3300 Chinese
## E Elena Female 3700 Malay
Step 6: To check structure and summary of data
Lastly, we might wish to check on the structure and summary of our data before proceed further for analysis. This could be done by:
1) str()
str(mydata)
## 'data.frame': 5 obs. of 5 variables:
## $ Name : chr "Ali" "Bella" "Chong" "Deva" ...
## $ Age : num 23 26 22 25 28
## $ Gender: chr "Male" "Female" "Male" "Male" ...
## $ Salary: num 2500 3000 3300 4000 3700
## $ Race : chr "Malay" "Chinese" "Chinese" "Indian" ...
2) summary()
summary(mydata)
## Name Age Gender Salary
## Length:5 Min. :22.0 Length:5 Min. :2500
## Class :character 1st Qu.:23.0 Class :character 1st Qu.:3000
## Mode :character Median :25.0 Mode :character Median :3300
## Mean :24.8 Mean :3300
## 3rd Qu.:26.0 3rd Qu.:3700
## Max. :28.0 Max. :4000
## Race
## Length:5
## Class :character
## Mode :character
##
##
##
Step 7: Some additional fun features!
In addition, we could do much more things with data frame. Let’s explore some of them as below:
1) If we would like to retrieve data of only 2 respondents, Bella and Deva.
mydata[c(2,4),]
## Name Age Gender Salary Race
## B Bella 26 Female 3000 Chinese
## D Deva 25 Male 4000 Indian
2) Using logical flag vectors to subset our data
Firstly, let’s check on how many male in our data.
mydata$Gender=="Male"
## [1] TRUE FALSE TRUE TRUE FALSE
As we could observe, there were 3 males, at position 1, 3 and 4. Then, we proceed further by subsetting only male data.
mydata[mydata$Gender=="Male",]
## Name Age Gender Salary Race
## A Ali 23 Male 2500 Malay
## C Chong 22 Male 3300 Chinese
## D Deva 25 Male 4000 Indian
Thus, we could subset the data as above. Furthermore, if we wish to remove the data of age, from our subsetted data of male individuals, then:
mydata[mydata$Gender=="Male",-2]
## Name Gender Salary Race
## A Ali Male 2500 Malay
## C Chong Male 3300 Chinese
## D Deva Male 4000 Indian
3) To add value to a variable
For instance, there is an incremental to the salary of all the individuals by 500. Thus, by adding 500, the results could be obtained.
mydata["Salary"]+500
## Salary
## A 3000
## B 3500
## C 3800
## D 4500
## E 4200
4) Dim function
Dim function could give us the number of obervations and variables in our data set.
dim(mydata)
## [1] 5 5
5) Adding new variable by using data on existing data
For instance, the salary in our exisitng data was in MYR. We would like to have a new variable which computing the salary in USD.
mydata$SalaryUSD<-mydata$Salary*0.24
mydata
## Name Age Gender Salary Race SalaryUSD
## A Ali 23 Male 2500 Malay 600
## B Bella 26 Female 3000 Chinese 720
## C Chong 22 Male 3300 Chinese 792
## D Deva 25 Male 4000 Indian 960
## E Elena 28 Female 3700 Malay 888
6) No record
In usual cases that shown above, we always having data shown in the output. However, there might be cases where no result obtained because did not satisfy the condition stated. For instance, we would like to find out if anyone having salary above 5000.
mydata[mydata$Salary>5000,]
## [1] Name Age Gender Salary Race SalaryUSD
## <0 rows> (or 0-length row.names)
7) Having more than 1 condition
If we would like extract data with male individuals OR having salary above 3500. It would provide us any data which satisfied either one of the criteria.
mydata4<-mydata[mydata$Gender=="Male"|mydata$Salary>3500,]
mydata4
## Name Age Gender Salary Race SalaryUSD
## A Ali 23 Male 2500 Malay 600
## C Chong 22 Male 3300 Chinese 792
## D Deva 25 Male 4000 Indian 960
## E Elena 28 Female 3700 Malay 888
If we would like extract data with male individuals AND having salary above 3500. It would only provide us any data which satisfied BOTH of the criteria.
mydata5<-mydata[mydata$Gender=="Male"& mydata$Salary>3500,]
mydata5
## Name Age Gender Salary Race SalaryUSD
## D Deva 25 Male 4000 Indian 960
8) typeof() and class()
By using these 2 functions, we are able to differentiate the different outputs provided. The function class() would tell us that the data is data frame, whereas typeof() would returned with result of list. We will use the function class() to check data type of variable.
typeof(mydata)
## [1] "list"
class(mydata)
## [1] "data.frame"
9) data.frame()
To confirm whether the data is data frame, we could also check it using function is.dataframe().
is.data.frame(mydata)
## [1] TRUE
10) head() & tail()
To retrieve only first few rows of our data, or last few rows of our data, function head() and tail() could be performed respectively.For instance, let’s retrieve first three rows and last three rows using these 2 functions (as shown below).
head(mydata,3)
## Name Age Gender Salary Race SalaryUSD
## A Ali 23 Male 2500 Malay 600
## B Bella 26 Female 3000 Chinese 720
## C Chong 22 Male 3300 Chinese 792
tail(mydata,3)
## Name Age Gender Salary Race SalaryUSD
## C Chong 22 Male 3300 Chinese 792
## D Deva 25 Male 4000 Indian 960
## E Elena 28 Female 3700 Malay 888
11) view()
In usual case, view function would be ran at the beginning of the project to study on the data set in R. It would put our data in a table form just like table in Excel file. Since our data is not too big, we did not use it at the beginning. However, if there was big amount of data, this function would be very useful. It allowed us to also sort the data from smallest to largest, A to Z etc. Also, we could search the data in a quicker way as well as filter to the data we preferred.
view(mydata)
All in all, it was a fun experience to explore data frame. It provided us many ways to obtain the result that we wanted. Happy leaning!