WQD 7004 Assignment 1

This assignment will explain and demonstrate data frame in R.

Question: What is data frame?

Data frame is a table or a two-dimensional array-like data structure that allow to store dataset in it. In each column of data frame contains values of one variable and each row contains one set of values from each column.
The characteristics of a data frame should be:

1. The column names should be non-empty.

2. The row names should be unique.

3. The data stored in a data frame can be of numeric, factor or character type.

4. Each column should contain same number of data items.

Functions such as : read.table(), read.csv(), read.delim(), read.fwf() can be used to read data from other source and convert into data frame.

In this assignment, the data frame will be created based on vector instead of importing dataset.

For the demonstration of the function of data frame, it will be categories in 6 parts, which are:

Part 1: Creating a data frame.

Part 2: Return back value in vector or data frame.

Part 3: Subset in data frame.

Part 4: Sorting and reordering in data frame.

Part 5: Extracting specific data from data frame.

Part 6: Showing partial of the dataframe.

Part 7: Additional methods that can be used in dataframe.

Part 8: Export and import data frame to CSV.

Part 1: Creating a data frame.

The example below shows a data frame of students with 3 list of vector.

x <-data.frame(
  std_id = c(1:10),
  std_name = c("Rick","Dan","Michelle","Ryan","Adde","Ali",
               "Bashri","Siti","Susan","Claris"),
  std_gender = c("M","M","F","M","F","M","M","F","F","F"))
print(x)

##    std_id std_name std_gender
## 1       1     Rick          M
## 2       2      Dan          M
## 3       3 Michelle          F
## 4       4     Ryan          M
## 5       5     Adde          F
## 6       6      Ali          M
## 7       7   Bashri          M
## 8       8     Siti          F
## 9       9    Susan          F
## 10     10   Claris          F

We can modify or adding new component of the data frame by using rbind() for row and cbind() for column.<>/p>

But in this example, I will demonstrate adding new component with cbind().

By combining 2 data frames, it requires to have the same amount of columns and for this example both data frame have the same amount of columns.

y <-data.frame( std_height = c(170,176,161,181,155,176,163,166,160,150),
                std_hometown = c("PENANG","KEDAH","SELANGOR","KUALA LUMPUR","JOHOR","PENANG","KEDAH","SELANGOR","KUALA LUMPUR","JOHOR"),
                std_grade = c("A","A","A","B","B","C","A","C","C","B"),
                std_age = c(14,15,16,16,15,17,15,13,15,16))
xy<-data.frame(x,y)
print(xy)

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16

For this part, a new column is added into the combined data frame using cbind().

xy <- cbind (xy, stud_attitude= c("Good","Average","Bad","Good","Average","Bad","Good","Average","Bad","Good"))
print(xy)

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 1           Good
## 2        Average
## 3            Bad
## 4           Good
## 5        Average
## 6            Bad
## 7           Good
## 8        Average
## 9            Bad
## 10          Good

This prints the combined dataframe.

print(xy)

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 1           Good
## 2        Average
## 3            Bad
## 4           Good
## 5        Average
## 6            Bad
## 7           Good
## 8        Average
## 9            Bad
## 10          Good

By applying str(), we can get the structure of the dataframe.

str(xy)

## 'data.frame':    10 obs. of  8 variables:
##  $ std_id       : int  1 2 3 4 5 6 7 8 9 10
##  $ std_name     : chr  "Rick" "Dan" "Michelle" "Ryan" ...
##  $ std_gender   : chr  "M" "M" "F" "M" ...
##  $ std_height   : num  170 176 161 181 155 176 163 166 160 150
##  $ std_hometown : chr  "PENANG" "KEDAH" "SELANGOR" "KUALA LUMPUR" ...
##  $ std_grade    : chr  "A" "A" "A" "B" ...
##  $ std_age      : num  14 15 16 16 15 17 15 13 15 16
##  $ stud_attitude: chr  "Good" "Average" "Bad" "Good" ...

By applying summary(), we can print the summary of the data frame that shows the min, the first quartile, median,mean, the third quartile and max of each variable .

print(summary(xy))

##      std_id        std_name          std_gender          std_height   
##  Min.   : 1.00   Length:10          Length:10          Min.   :150.0  
##  1st Qu.: 3.25   Class :character   Class :character   1st Qu.:160.2  
##  Median : 5.50   Mode  :character   Mode  :character   Median :164.5  
##  Mean   : 5.50                                         Mean   :165.8  
##  3rd Qu.: 7.75                                         3rd Qu.:174.5  
##  Max.   :10.00                                         Max.   :181.0  
##  std_hometown        std_grade            std_age     stud_attitude     
##  Length:10          Length:10          Min.   :13.0   Length:10         
##  Class :character   Class :character   1st Qu.:15.0   Class :character  
##  Mode  :character   Mode  :character   Median :15.0   Mode  :character  
##                                        Mean   :15.2                     
##                                        3rd Qu.:16.0                     
##                                        Max.   :17.0

For the function that can be used in a data frame they are:

Function names() shows the names of all column available i in the data frame.

names(xy)

## [1] "std_id"        "std_name"      "std_gender"    "std_height"   
## [5] "std_hometown"  "std_grade"     "std_age"       "stud_attitude"

Function row.names() shows the names of all row available i in the data frame.

row.names(xy)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Function nrow() shows the number of rows available in the data frame.

nrow(xy)

## [1] 10

Function ncol() shows the number of columns available in the data frame.

ncol(xy)

## [1] 8

Function length() shows the length of vectors in the data frame.

length(xy)

## [1] 8

Function typeof() shows the type of storage mode of any object in the data frame.

typeof(xy)

## [1] "list"

Part 2: Return back value in vector or data frame.

If we want to access the data frame like a list in vector form we can use this 3 methods:

Method 1: Using “data frame name”$“variable name”

xy$std_name

##  [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Adde"     "Ali"     
##  [7] "Bashri"   "Siti"     "Susan"    "Claris"

Method 2:Using “data frame name”[[“variable name”]]

xy[["std_name"]]

##  [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Adde"     "Ali"     
##  [7] "Bashri"   "Siti"     "Susan"    "Claris"

Method 3: Using “data frame name”[[number of selected column]]

xy[[2]]

##  [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Adde"     "Ali"     
##  [7] "Bashri"   "Siti"     "Susan"    "Claris"

class(xy[[2]])

## [1] "character"

But if we want to access the data frame in data frame form, we can write it as:

Method 1:

xy["std_name"]

##    std_name
## 1      Rick
## 2       Dan
## 3  Michelle
## 4      Ryan
## 5      Adde
## 6       Ali
## 7    Bashri
## 8      Siti
## 9     Susan
## 10   Claris

Method 2:

xy[2]

##    std_name
## 1      Rick
## 2       Dan
## 3  Michelle
## 4      Ryan
## 5      Adde
## 6       Ali
## 7    Bashri
## 8      Siti
## 9     Susan
## 10   Claris

class(xy[2])

## [1] "data.frame"

For R in returning back as a vector or a data frame, it can be done by changing the drop argument.

For example, x2 is set to get column 2 (std_name) data,without setting the drop argument, the default setting is TRUE and the element is return in vector.

x2<-xy[,2]
print(x2)

##  [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Adde"     "Ali"     
##  [7] "Bashri"   "Siti"     "Susan"    "Claris"

class(x2)

## [1] "character"

If we set drop=FALSE, we can return the element as dataframe.

x2<-xy[,2,drop=FALSE]
print(x2)

##    std_name
## 1      Rick
## 2       Dan
## 3  Michelle
## 4      Ryan
## 5      Adde
## 6       Ali
## 7    Bashri
## 8      Siti
## 9     Susan
## 10   Claris

class(x2)

## [1] "data.frame"

Also there are different way to return in vector or dataframe which are:

Methods return in vector:

class(xy[["std_name"]])

## [1] "character"

class(xy[[2]])

## [1] "character"

class(xy$std_name)

## [1] "character"

Methods return in dataframe:

class(xy["std_name"])

## [1] "data.frame"

class(xy[2])

## [1] "data.frame"

If it is retrieve more than 1 col, it is returned as dataframe.

class(xy[c("std_name","std_id")])

## [1] "data.frame"

If we specify just in row, it is returned as dataframe.

print(xy[2,])

##   std_id std_name std_gender std_height std_hometown std_grade std_age
## 2      2      Dan          M        176        KEDAH         A      15
##   stud_attitude
## 2       Average

class(xy[2,])

## [1] "data.frame"

Part 3: Subset in data frame.

For getting the subset in data frame, there are a few example we can use which are :

Eg 1: How would you show all the rows except the last row? (the last row is std_age)

Method : Extract row 1 to 6 and all columns.

xy[1:6,]

##   std_id std_name std_gender std_height std_hometown std_grade std_age
## 1      1     Rick          M        170       PENANG         A      14
## 2      2      Dan          M        176        KEDAH         A      15
## 3      3 Michelle          F        161     SELANGOR         A      16
## 4      4     Ryan          M        181 KUALA LUMPUR         B      16
## 5      5     Adde          F        155        JOHOR         B      15
## 6      6      Ali          M        176       PENANG         C      17
##   stud_attitude
## 1          Good
## 2       Average
## 3           Bad
## 4          Good
## 5       Average
## 6           Bad

Eg 2: How would you get the last 6 rows of the data frame?

Method : Extract row 2 to 7 and all columns

xy[2:7,]

##   std_id std_name std_gender std_height std_hometown std_grade std_age
## 2      2      Dan          M        176        KEDAH         A      15
## 3      3 Michelle          F        161     SELANGOR         A      16
## 4      4     Ryan          M        181 KUALA LUMPUR         B      16
## 5      5     Adde          F        155        JOHOR         B      15
## 6      6      Ali          M        176       PENANG         C      17
## 7      7   Bashri          M        163        KEDAH         A      15
##   stud_attitude
## 2       Average
## 3           Bad
## 4          Good
## 5       Average
## 6           Bad
## 7          Good

Eg 3: How to get data from row 3 column 5 (the ans should be SELANGOR)

xy[3,"std_hometown"]

## [1] "SELANGOR"

Part 4: Sorting and reordering in data frame.

This example shows that we could sort the dataframe by std_height from shortest to highest.

sort(xy$std_height)

##  [1] 150 155 160 161 163 166 170 176 176 181

print(xy)

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 1           Good
## 2        Average
## 3            Bad
## 4           Good
## 5        Average
## 6            Bad
## 7           Good
## 8        Average
## 9            Bad
## 10          Good

By using order() function, it returns the vector ranks with a rank position of each element.

ranks <- order(xy$std_height)
print(ranks)

##  [1] 10  5  9  3  7  8  1  2  6  4

xy$std_height

##  [1] 170 176 161 181 155 176 163 166 160 150

By comparing the value of std_height, it shows that:

The lowest value is 150: its index is 10 which comes first in ranks.

The second lowest value is 155: its index is 5 which comes second in ranks.

The highest value is 181 : its index is 4 which comes last in ranks.

The ranks vector that contains indices that can be used to perform an order dataframe based on std_height.

xy[ranks,]

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 10     10   Claris          F        150        JOHOR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 6       6      Ali          M        176       PENANG         C      17
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
##    stud_attitude
## 10          Good
## 5        Average
## 9            Bad
## 3            Bad
## 7           Good
## 8        Average
## 1           Good
## 2        Average
## 6            Bad
## 4           Good

By getting a descending order dataframe, decreasing argument can be set to TRUE

xy[order(xy$std_height,decreasing = TRUE),]

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 2       2      Dan          M        176        KEDAH         A      15
## 6       6      Ali          M        176       PENANG         C      17
## 1       1     Rick          M        170       PENANG         A      14
## 8       8     Siti          F        166     SELANGOR         C      13
## 7       7   Bashri          M        163        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 5       5     Adde          F        155        JOHOR         B      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 4           Good
## 2        Average
## 6            Bad
## 1           Good
## 8        Average
## 7           Good
## 3            Bad
## 9            Bad
## 5        Average
## 10          Good

Part 5: Extracting specific data from data frame.

This is an example that show the extracting of specific data from data frame.

A new dataframe is defined as x1 and list 1 to 3 of dataframe xy is extracted into this new data frame.

By using this method, it allows us to acquired specific data we want to use on specific data frame.

x1 <- data.frame(xy$std_id,xy$std_name,xy$std_gender)
print(x1)

##    xy.std_id xy.std_name xy.std_gender
## 1          1        Rick             M
## 2          2         Dan             M
## 3          3    Michelle             F
## 4          4        Ryan             M
## 5          5        Adde             F
## 6          6         Ali             M
## 7          7      Bashri             M
## 8          8        Siti             F
## 9          9       Susan             F
## 10        10      Claris             F

Part 6: Showing partial of the data frame.

For large dataset, we could use function such as: head() and tail() to give us a view of the elements and structure of the dataset.

By using function head(), it would show the first 6 rows of the data frame.

print(head(xy))

##   std_id std_name std_gender std_height std_hometown std_grade std_age
## 1      1     Rick          M        170       PENANG         A      14
## 2      2      Dan          M        176        KEDAH         A      15
## 3      3 Michelle          F        161     SELANGOR         A      16
## 4      4     Ryan          M        181 KUALA LUMPUR         B      16
## 5      5     Adde          F        155        JOHOR         B      15
## 6      6      Ali          M        176       PENANG         C      17
##   stud_attitude
## 1          Good
## 2       Average
## 3           Bad
## 4          Good
## 5       Average
## 6           Bad

By using function tail(), it would show the last 6 rows of the data frame.

print(tail(xy))

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 5        Average
## 6            Bad
## 7           Good
## 8        Average
## 9            Bad
## 10          Good

Part 7: Additional methods that can be used in dataframe.

Method: Accessing data frame through logical vector as index.

Eg 1: Use 2 logical vector (True and False) to access all component of the data frame.

print(xy[c(T,T,F,T,T,F,T,F,T,F),c(T,T,F,F,F,T,T,T)])

##   std_id std_name std_grade std_age stud_attitude
## 1      1     Rick         A      14          Good
## 2      2      Dan         A      15       Average
## 4      4     Ryan         B      16          Good
## 5      5     Adde         B      15       Average
## 7      7   Bashri         A      15          Good
## 9      9    Susan         C      15           Bad

Eg 2: Select all rows, 2 logical vector for the column index.

It is shown that it is recycled into 4 element logical vector.

print(xy[,c(T,F)])

##    std_id std_gender std_hometown std_age
## 1       1          M       PENANG      14
## 2       2          M        KEDAH      15
## 3       3          F     SELANGOR      16
## 4       4          M KUALA LUMPUR      16
## 5       5          F        JOHOR      15
## 6       6          M       PENANG      17
## 7       7          M        KEDAH      15
## 8       8          F     SELANGOR      13
## 9       9          F KUALA LUMPUR      15
## 10     10          F        JOHOR      16

Eg 3: Select column std_grade = A and print it .

print(xy[xy$std_grade=="A",])

##   std_id std_name std_gender std_height std_hometown std_grade std_age
## 1      1     Rick          M        170       PENANG         A      14
## 2      2      Dan          M        176        KEDAH         A      15
## 3      3 Michelle          F        161     SELANGOR         A      16
## 7      7   Bashri          M        163        KEDAH         A      15
##   stud_attitude
## 1          Good
## 2       Average
## 3           Bad
## 7          Good

Part 8: Export and import data frame to CSV.

For the export and import of data frame in R, it can be done by using function write.csv() and read.csv().

The example below shows that the data frame created in this assignment has named as “assignment 1” and exported in csv format.

write.csv(xy,"assignment 1.csv",row.names = F)

For the re-import “assignment 1.csv” into Rstudio, it can be done by using read.csv, the example is show below:

assignment1 = read.csv("assignment 1.csv")
assignment1

##    std_id std_name std_gender std_height std_hometown std_grade std_age
## 1       1     Rick          M        170       PENANG         A      14
## 2       2      Dan          M        176        KEDAH         A      15
## 3       3 Michelle          F        161     SELANGOR         A      16
## 4       4     Ryan          M        181 KUALA LUMPUR         B      16
## 5       5     Adde          F        155        JOHOR         B      15
## 6       6      Ali          M        176       PENANG         C      17
## 7       7   Bashri          M        163        KEDAH         A      15
## 8       8     Siti          F        166     SELANGOR         C      13
## 9       9    Susan          F        160 KUALA LUMPUR         C      15
## 10     10   Claris          F        150        JOHOR         B      16
##    stud_attitude
## 1           Good
## 2        Average
## 3            Bad
## 4           Good
## 5        Average
## 6            Bad
## 7           Good
## 8        Average
## 9            Bad
## 10          Good

WQD 7004 Assignment 1

Sunny Chan Zi Yang_S2022037

11/11/2020

This assignment will explain and demonstrate data frame in R.

Question: What is data frame?