What is Data Frame?

How to Create Data Frame?

df1 <- read.csv('C:/Users/TM37075/Documents/UM Master of Data Science/WQD7004 Programming in Data Science/employee.csv')
df1
##   emp_id  emp_name emp_hometown
## 1  37075    Xin Li Pulau Pinang
## 2  13000   Fatimah     Selangor
## 3  12483 Zul Haziq        Sabah
## 4  79630    Danial        Perak
## 5  15125      Yoon        Johor
## 6  13398     Kalai       Pahang
## 7  18854  Sharveen     Kelantan
## 8  11044      Ryan Kuala Lumpur
emp_id<-c(37075,13000,12483,79630,15125,13398,18854,11044)

emp_name<-c("Xin Li","Fatimah","Zul Haziq","Danial","Yoon","Kalai","Sharveen","Ryan")

emp_hometown<-c("Pulau Pinang","Selangor","Sabah","Perak","Johor","Pahang","Kelantan","Kuala Lumpur")

df2<-data.frame(emp_id,emp_name,emp_hometown)

df2
##   emp_id  emp_name emp_hometown
## 1  37075    Xin Li Pulau Pinang
## 2  13000   Fatimah     Selangor
## 3  12483 Zul Haziq        Sabah
## 4  79630    Danial        Perak
## 5  15125      Yoon        Johor
## 6  13398     Kalai       Pahang
## 7  18854  Sharveen     Kelantan
## 8  11044      Ryan Kuala Lumpur

Basic Function used in Data Frame

  1. str(): to show the structure of data frame
str(df1)
## 'data.frame':    8 obs. of  3 variables:
##  $ emp_id      : int  37075 13000 12483 79630 15125 13398 18854 11044
##  $ emp_name    : chr  "Xin Li" "Fatimah" "Zul Haziq" "Danial" ...
##  $ emp_hometown: chr  "Pulau Pinang" "Selangor" "Sabah" "Perak" ...

The structure of employee data set tells us the number of rows (observations) and columns (variables). Besides, it also tells us the column names, their data type.

  1. head() / tail(): to show the first or last 6 rows by default of the data frame.
head(df1)
##   emp_id  emp_name emp_hometown
## 1  37075    Xin Li Pulau Pinang
## 2  13000   Fatimah     Selangor
## 3  12483 Zul Haziq        Sabah
## 4  79630    Danial        Perak
## 5  15125      Yoon        Johor
## 6  13398     Kalai       Pahang
tail(df1, n = 4)
##   emp_id emp_name emp_hometown
## 5  15125     Yoon        Johor
## 6  13398    Kalai       Pahang
## 7  18854 Sharveen     Kelantan
## 8  11044     Ryan Kuala Lumpur

By default, both head() or tail() function will return the first or last 6 rows of the data frame.However, we also can make a tweak to specfic how many rows we wanted to view. In this case, I only called the last 4 rows of data frame.

Accessing the components of a data frame

  1. Accessing like a list
df1['emp_id']
##   emp_id
## 1  37075
## 2  13000
## 3  12483
## 4  79630
## 5  15125
## 6  13398
## 7  18854
## 8  11044
df1[['emp_id']]
## [1] 37075 13000 12483 79630 15125 13398 18854 11044
df1$'emp_id'
## [1] 37075 13000 12483 79630 15125 13398 18854 11044
  1. Accessing like a matrix
df1[2,]
##   emp_id emp_name emp_hometown
## 2  13000  Fatimah     Selangor
df1[,3]
## [1] "Pulau Pinang" "Selangor"     "Sabah"        "Perak"        "Johor"       
## [6] "Pahang"       "Kelantan"     "Kuala Lumpur"
df1[,3, drop = FALSE]
##   emp_hometown
## 1 Pulau Pinang
## 2     Selangor
## 3        Sabah
## 4        Perak
## 5        Johor
## 6       Pahang
## 7     Kelantan
## 8 Kuala Lumpur

As you may observed, different way of accessing dataframe will return different result, some in the form of vector, some remains data frame.

Add, Remove, Modify Component of Data Frame

  1. Adding New Rows or Columns
rbind(df1,list(13332,"Alif","Perlis"))
##   emp_id  emp_name emp_hometown
## 1  37075    Xin Li Pulau Pinang
## 2  13000   Fatimah     Selangor
## 3  12483 Zul Haziq        Sabah
## 4  79630    Danial        Perak
## 5  15125      Yoon        Johor
## 6  13398     Kalai       Pahang
## 7  18854  Sharveen     Kelantan
## 8  11044      Ryan Kuala Lumpur
## 9  13332      Alif       Perlis
cbind(df1,Salary=c(4120.80,3870,5010.50,2800,3100,3365,4520.30,6100))
##   emp_id  emp_name emp_hometown Salary
## 1  37075    Xin Li Pulau Pinang 4120.8
## 2  13000   Fatimah     Selangor 3870.0
## 3  12483 Zul Haziq        Sabah 5010.5
## 4  79630    Danial        Perak 2800.0
## 5  15125      Yoon        Johor 3100.0
## 6  13398     Kalai       Pahang 3365.0
## 7  18854  Sharveen     Kelantan 4520.3
## 8  11044      Ryan Kuala Lumpur 6100.0
df1
##   emp_id  emp_name emp_hometown
## 1  37075    Xin Li Pulau Pinang
## 2  13000   Fatimah     Selangor
## 3  12483 Zul Haziq        Sabah
## 4  79630    Danial        Perak
## 5  15125      Yoon        Johor
## 6  13398     Kalai       Pahang
## 7  18854  Sharveen     Kelantan
## 8  11044      Ryan Kuala Lumpur
df1$Salary<-c(4120.80,3870,5010.50,2800,3100,3365,4520.30,6100)

df1
##   emp_id  emp_name emp_hometown Salary
## 1  37075    Xin Li Pulau Pinang 4120.8
## 2  13000   Fatimah     Selangor 3870.0
## 3  12483 Zul Haziq        Sabah 5010.5
## 4  79630    Danial        Perak 2800.0
## 5  15125      Yoon        Johor 3100.0
## 6  13398     Kalai       Pahang 3365.0
## 7  18854  Sharveen     Kelantan 4520.3
## 8  11044      Ryan Kuala Lumpur 6100.0

We can use rbind() to add a row and cbind() to add a column. Besides, since data frames are implemented as list, we can also add new column through simple list-like assignments.

However, you can noticed that either cbind() or rbind() doesn’t really add new column or row to the data frame unless we use the assignment operation.

  1. Removing a Row or Column
df1$emp_hometown<-NULL
df1<-df1[-3,]

df1
##   emp_id emp_name Salary
## 1  37075   Xin Li 4120.8
## 2  13000  Fatimah 3870.0
## 4  79630   Danial 2800.0
## 5  15125     Yoon 3100.0
## 6  13398    Kalai 3365.0
## 7  18854 Sharveen 4520.3
## 8  11044     Ryan 6100.0

The code above shows that we remove emp_hometwon column and the third row from data frame.The row is deleted through reassignments.

  1. Moidfy a Component in Data Frame
df1[1,'Salary']<-6888.80

df1
##   emp_id emp_name Salary
## 1  37075   Xin Li 6888.8
## 2  13000  Fatimah 3870.0
## 4  79630   Danial 2800.0
## 5  15125     Yoon 3100.0
## 6  13398    Kalai 3365.0
## 7  18854 Sharveen 4520.3
## 8  11044     Ryan 6100.0

The code above shows that we modify the first row’s salary to 6888.80.

Sort a Data Frame

df3<-df1[order(-df1$Salary),]
df3
##   emp_id emp_name Salary
## 1  37075   Xin Li 6888.8
## 8  11044     Ryan 6100.0
## 7  18854 Sharveen 4520.3
## 2  13000  Fatimah 3870.0
## 6  13398    Kalai 3365.0
## 5  15125     Yoon 3100.0
## 4  79630   Danial 2800.0

By using Order(), we can sort our data. In this case I sort the salary from highest to lowest.Noted that - sign is used for descending order.

Summary of Data Frame

summary(df1)
##      emp_id        emp_name             Salary    
##  Min.   :11044   Length:7           Min.   :2800  
##  1st Qu.:13199   Class :character   1st Qu.:3232  
##  Median :15125   Mode  :character   Median :3870  
##  Mean   :26875                      Mean   :4378  
##  3rd Qu.:27965                      3rd Qu.:5310  
##  Max.   :79630                      Max.   :6889

The summary() prvovides summary statistics on the columns of the data frame including the min, max, mean, median and quartiles. In this case, we can identify the mean salary is about RM 4378.00.

Simple Plotting of Data Frame

hist(df1$Salary,main = "Distribution of Salaries", xlab="Salary (RM)",ylab="Frequency",col="blue",ylim=c(0,5))

In the example, I use hist() to plot a histogram to see distribution of salieries of employee.From the graph, we can observed 1 employee having salary Rm 2000 - RM 3000, 3 employee having salary of RM 3000 - RM 4000 and so on.