df1 <- read.csv('C:/Users/TM37075/Documents/UM Master of Data Science/WQD7004 Programming in Data Science/employee.csv')
df1
## emp_id emp_name emp_hometown
## 1 37075 Xin Li Pulau Pinang
## 2 13000 Fatimah Selangor
## 3 12483 Zul Haziq Sabah
## 4 79630 Danial Perak
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
## 7 18854 Sharveen Kelantan
## 8 11044 Ryan Kuala Lumpur
emp_id<-c(37075,13000,12483,79630,15125,13398,18854,11044)
emp_name<-c("Xin Li","Fatimah","Zul Haziq","Danial","Yoon","Kalai","Sharveen","Ryan")
emp_hometown<-c("Pulau Pinang","Selangor","Sabah","Perak","Johor","Pahang","Kelantan","Kuala Lumpur")
df2<-data.frame(emp_id,emp_name,emp_hometown)
df2
## emp_id emp_name emp_hometown
## 1 37075 Xin Li Pulau Pinang
## 2 13000 Fatimah Selangor
## 3 12483 Zul Haziq Sabah
## 4 79630 Danial Perak
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
## 7 18854 Sharveen Kelantan
## 8 11044 Ryan Kuala Lumpur
str(df1)
## 'data.frame': 8 obs. of 3 variables:
## $ emp_id : int 37075 13000 12483 79630 15125 13398 18854 11044
## $ emp_name : chr "Xin Li" "Fatimah" "Zul Haziq" "Danial" ...
## $ emp_hometown: chr "Pulau Pinang" "Selangor" "Sabah" "Perak" ...
The structure of employee data set tells us the number of rows (observations) and columns (variables). Besides, it also tells us the column names, their data type.
head(df1)
## emp_id emp_name emp_hometown
## 1 37075 Xin Li Pulau Pinang
## 2 13000 Fatimah Selangor
## 3 12483 Zul Haziq Sabah
## 4 79630 Danial Perak
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
tail(df1, n = 4)
## emp_id emp_name emp_hometown
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
## 7 18854 Sharveen Kelantan
## 8 11044 Ryan Kuala Lumpur
By default, both head() or tail() function will return the first or last 6 rows of the data frame.However, we also can make a tweak to specfic how many rows we wanted to view. In this case, I only called the last 4 rows of data frame.
df1['emp_id']
## emp_id
## 1 37075
## 2 13000
## 3 12483
## 4 79630
## 5 15125
## 6 13398
## 7 18854
## 8 11044
df1[['emp_id']]
## [1] 37075 13000 12483 79630 15125 13398 18854 11044
df1$'emp_id'
## [1] 37075 13000 12483 79630 15125 13398 18854 11044
df1[2,]
## emp_id emp_name emp_hometown
## 2 13000 Fatimah Selangor
df1[,3]
## [1] "Pulau Pinang" "Selangor" "Sabah" "Perak" "Johor"
## [6] "Pahang" "Kelantan" "Kuala Lumpur"
df1[,3, drop = FALSE]
## emp_hometown
## 1 Pulau Pinang
## 2 Selangor
## 3 Sabah
## 4 Perak
## 5 Johor
## 6 Pahang
## 7 Kelantan
## 8 Kuala Lumpur
As you may observed, different way of accessing dataframe will return different result, some in the form of vector, some remains data frame.
rbind(df1,list(13332,"Alif","Perlis"))
## emp_id emp_name emp_hometown
## 1 37075 Xin Li Pulau Pinang
## 2 13000 Fatimah Selangor
## 3 12483 Zul Haziq Sabah
## 4 79630 Danial Perak
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
## 7 18854 Sharveen Kelantan
## 8 11044 Ryan Kuala Lumpur
## 9 13332 Alif Perlis
cbind(df1,Salary=c(4120.80,3870,5010.50,2800,3100,3365,4520.30,6100))
## emp_id emp_name emp_hometown Salary
## 1 37075 Xin Li Pulau Pinang 4120.8
## 2 13000 Fatimah Selangor 3870.0
## 3 12483 Zul Haziq Sabah 5010.5
## 4 79630 Danial Perak 2800.0
## 5 15125 Yoon Johor 3100.0
## 6 13398 Kalai Pahang 3365.0
## 7 18854 Sharveen Kelantan 4520.3
## 8 11044 Ryan Kuala Lumpur 6100.0
df1
## emp_id emp_name emp_hometown
## 1 37075 Xin Li Pulau Pinang
## 2 13000 Fatimah Selangor
## 3 12483 Zul Haziq Sabah
## 4 79630 Danial Perak
## 5 15125 Yoon Johor
## 6 13398 Kalai Pahang
## 7 18854 Sharveen Kelantan
## 8 11044 Ryan Kuala Lumpur
df1$Salary<-c(4120.80,3870,5010.50,2800,3100,3365,4520.30,6100)
df1
## emp_id emp_name emp_hometown Salary
## 1 37075 Xin Li Pulau Pinang 4120.8
## 2 13000 Fatimah Selangor 3870.0
## 3 12483 Zul Haziq Sabah 5010.5
## 4 79630 Danial Perak 2800.0
## 5 15125 Yoon Johor 3100.0
## 6 13398 Kalai Pahang 3365.0
## 7 18854 Sharveen Kelantan 4520.3
## 8 11044 Ryan Kuala Lumpur 6100.0
We can use rbind() to add a row and cbind() to add a column. Besides, since data frames are implemented as list, we can also add new column through simple list-like assignments.
However, you can noticed that either cbind() or rbind() doesn’t really add new column or row to the data frame unless we use the assignment operation.
df1$emp_hometown<-NULL
df1<-df1[-3,]
df1
## emp_id emp_name Salary
## 1 37075 Xin Li 4120.8
## 2 13000 Fatimah 3870.0
## 4 79630 Danial 2800.0
## 5 15125 Yoon 3100.0
## 6 13398 Kalai 3365.0
## 7 18854 Sharveen 4520.3
## 8 11044 Ryan 6100.0
The code above shows that we remove emp_hometwon column and the third row from data frame.The row is deleted through reassignments.
df1[1,'Salary']<-6888.80
df1
## emp_id emp_name Salary
## 1 37075 Xin Li 6888.8
## 2 13000 Fatimah 3870.0
## 4 79630 Danial 2800.0
## 5 15125 Yoon 3100.0
## 6 13398 Kalai 3365.0
## 7 18854 Sharveen 4520.3
## 8 11044 Ryan 6100.0
The code above shows that we modify the first row’s salary to 6888.80.
df3<-df1[order(-df1$Salary),]
df3
## emp_id emp_name Salary
## 1 37075 Xin Li 6888.8
## 8 11044 Ryan 6100.0
## 7 18854 Sharveen 4520.3
## 2 13000 Fatimah 3870.0
## 6 13398 Kalai 3365.0
## 5 15125 Yoon 3100.0
## 4 79630 Danial 2800.0
By using Order(), we can sort our data. In this case I sort the salary from highest to lowest.Noted that - sign is used for descending order.
summary(df1)
## emp_id emp_name Salary
## Min. :11044 Length:7 Min. :2800
## 1st Qu.:13199 Class :character 1st Qu.:3232
## Median :15125 Mode :character Median :3870
## Mean :26875 Mean :4378
## 3rd Qu.:27965 3rd Qu.:5310
## Max. :79630 Max. :6889
The summary() prvovides summary statistics on the columns of the data frame including the min, max, mean, median and quartiles. In this case, we can identify the mean salary is about RM 4378.00.
hist(df1$Salary,main = "Distribution of Salaries", xlab="Salary (RM)",ylab="Frequency",col="blue",ylim=c(0,5))
In the example, I use hist() to plot a histogram to see distribution of salieries of employee.From the graph, we can observed 1 employee having salary Rm 2000 - RM 3000, 3 employee having salary of RM 3000 - RM 4000 and so on.