Assignment 1

1.What is the data frame?

Data Frame is a two-dimensional data structure and used to store tabular data in R. It has the variables of a data set as columns and observations as rows. Each column could have different classes of objects.

You could regard data frame as a special case of list which has each component of equal length.

The first six row of a built in data frame example mtcars is printed.

# Load the data
data(mtcars)
# Print the first six rows of the built in data frame mtcars
print(head(mtcars))

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2.Have a look at the data frame

It is common to work with a large data set in the form of data frame and you want to have a clear understanding of its elements and structures at first.

Showing a small part of the entire data set is often useful while functions head() and tail(), str(), summary() is helpful.

Function head() allows you to show the first 6 rows of a data frame and we print the first 6 rows of the data frame mtcars using head().

print(head(mtcars))

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Function tail() allows you to show the last 6 rows of a data frame and we print the last 6 rows of mtcars using tail().

print(tail(mtcars))

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Function str() shows you the structure of a data frame and you could view the following information.

The total number of observations(rows)
The total number of Variables(columns)
A full list of the columns names
The data type of each column
The first observations

Here we print the structure of the data frame mtcars.

print(str(mtcars))

## 'data.frame':  32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## NULL

Function summary() shows you min, the first quartile, median,mean, the third quartile and max of each variable and we print the summary of the first four columns of the data frame mtcars.

print(summary(mtcars[,1:4]))

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0

3.Creating a data frame

Instead of using built-in data frame, you are probably interested in creating your own data frame and here are three methods:

Method one: read.csv() or read.table() will generate a data frame.We use read.csv() to read a local file HelloWorld.csv and generate a data frame dataframe_1 and then print it.
```
dataframe_1 <- read.csv("C:/WQD7004/HelloWorld.csv")
print(dataframe_1)
```
```
##       X col_1   col_2
## 1 row_1     1 "hello"
## 2 row_2     2 "world"
```
Method two: A data frame can also be coerced from other types of objects like lists. Here We coerce a list whose each component has the same length to a data frame dataframe_2 and then print dataframe_2.
```
dataframe_2 <- as.data.frame(list(a = 1:4, b = 100:103))
print(dataframe_2)
```
```
##   a   b
## 1 1 100
## 2 2 101
## 3 3 102
## 4 4 103
```
Method three: Explicitly create a data frame with the function data.frame(). As arguments, the vectors you pass will become the different columns of your data frame. Each column has the same length and they could contain different types of data. Here we create a dataframe dataframe_3 using data.frame() and then print dataframe_3.
```
dataframe_3 <- data.frame(name = c('Andrew','Zada', 'Fanny','Iris', 'Jack'), ID = 10:14, Hobby = c('movie', 'music','dance','swim','reading'),From_China = c(TRUE, FALSE, TRUE, FALSE, TRUE))
print(dataframe_3)
```
```
##     name ID   Hobby From_China
## 1 Andrew 10   movie       TRUE
## 2   Zada 11   music      FALSE
## 3  Fanny 12   dance       TRUE
## 4   Iris 13    swim      FALSE
## 5   Jack 14 reading       TRUE
```

4.Explore your own created data frame

Apart from head(), tail(), str() and summary(),you want to further your own created data frame and the following functions is helpful.

names() shows you the names of all columns and we print the columns names of dataframe_3 using names().
```
print(names(dataframe_3))
```
```
## [1] "name"       "ID"         "Hobby"      "From_China"
```
row.names() shows you the names of all rows and we print the rows names of dataframe_3 using row.names().
```
print(row.names(dataframe_3))
```
```
## [1] "1" "2" "3" "4" "5"
```
ncol() shows you the number of columns and we print the number of dataframe_3 using ncol().
```
print(ncol(dataframe_3))
```
```
## [1] 4
```
nrow() shows you the numbers of rows and we print the number of dataframe_3 using nrow().
```
print(nrow(dataframe_3))
```
```
## [1] 5
```
typeof() tells you the (R internal) type or storage mode of any object and we print the typeof(dataframe_3).
```
print(typeof(dataframe_3))
```
```
## [1] "list"
```

5.Change the names of rows or columns of a data frame

For dataframe_3, you may notice the names of rows are integers instead of strings while the formats for columns names are inconsistent like name and ID and then you want to change them.

Change the names of rows: Reassign the rows names of a data frame using the function row.names(). We reassign the rows names of dataframe_3 using row.names() and then print dataframe_3.

row.names(dataframe_3) <- c('row_1', 'row_2', 'row_3', 'row_4', 'row_5')
print(dataframe_3)

##         name ID   Hobby From_China
## row_1 Andrew 10   movie       TRUE
## row_2   Zada 11   music      FALSE
## row_3  Fanny 12   dance       TRUE
## row_4   Iris 13    swim      FALSE
## row_5   Jack 14 reading       TRUE

Change the names of columns: Reassign the columns names of a data frame using the function names(). We reassign the columns names of dataframe_3 using names() and then print dataframe_3.

names(dataframe_3) <- c("NAME","ID","HOBBY","FROM_CHINA")
print(dataframe_3)

##         NAME ID   HOBBY FROM_CHINA
## row_1 Andrew 10   movie       TRUE
## row_2   Zada 11   music      FALSE
## row_3  Fanny 12   dance       TRUE
## row_4   Iris 13    swim      FALSE
## row_5   Jack 14 reading       TRUE

Change the name using the logical vector. We reassign the name of column HOBBY to INTEREST and then print dataframe_3.

names(dataframe_3)[names(dataframe_3) == 'HOBBY'] <- 'INTEREST'
print(dataframe_3)

##         NAME ID INTEREST FROM_CHINA
## row_1 Andrew 10    movie       TRUE
## row_2   Zada 11    music      FALSE
## row_3  Fanny 12    dance       TRUE
## row_4   Iris 13     swim      FALSE
## row_5   Jack 14  reading       TRUE

6.Accessing the component of a data frame

Method 1: Access like a list to access columns of a data frame

Example 1: Use [ operator to select the column NAME of dataframe_3. Then print it and its class and we see it is a data frame.
```
print(dataframe_3['NAME'])
```
```
##         NAME
## row_1 Andrew
## row_2   Zada
## row_3  Fanny
## row_4   Iris
## row_5   Jack
```
```
print(class(dataframe_3['NAME']))
```
```
## [1] "data.frame"
```
Example 2: Use [[ operator to select the column NAME. Then print it and its class. We find it is a vector.
```
print(dataframe_3[['NAME']])
```
```
## [1] "Andrew" "Zada"   "Fanny"  "Iris"   "Jack"
```
```
print(class(dataframe_3[['NAME']]))
```
```
## [1] "character"
```
Example 3: Use $ operator to select the column NAME and then print it and its class. We could notice it is a vector.
```
print(dataframe_3 $NAME)
```
```
## [1] "Andrew" "Zada"   "Fanny"  "Iris"   "Jack"
```
```
print(class(dataframe_3 $NAME))
```
```
## [1] "character"
```
Example 4: Here we access the value in column NAME and the second row and then print it.
```
print(dataframe_3 $ NAME[2])
```
```
## [1] "Zada"
```
Method 2: Access like a matrix by providing index for rows and columns

Example 1: Select the second row of dataframe_3, print it and its class. We notice it is a data frame.
```
print(dataframe_3[2,])
```
```
##       NAME ID INTEREST FROM_CHINA
## row_2 Zada 11    music      FALSE
```
```
print(class(dataframe_3[2,]))
```
```
## [1] "data.frame"
```
Example 2: Select the second column of dataframe_3,print it and its class. We see it is a vector.
```
print(dataframe_3[,2])
```
```
## [1] 10 11 12 13 14
```
```
print(class(dataframe_3[,2]))
```
```
## [1] "integer"
```
If you want to change the returned vector to a data frame, just set drop = FALSE. Here we set the argument drop in dataframe_3[,2] to FALSE, print it and its class, notice its class is a data frame.
```
print(dataframe_3[,2,drop = FALSE])
```
```
##       ID
## row_1 10
## row_2 11
## row_3 12
## row_4 13
## row_5 14
```
```
print(class(dataframe_3[,2,drop = FALSE]))
```
```
## [1] "data.frame"
```
Example 3: Select the value in the second row and column NAME of dataframe_3 and then print it.
```
print(dataframe_3[2,'NAME'])
```
```
## [1] "Zada"
```

Advanced method 1: Access through the logical vector as index

Example 1: We use two logical vectors to access the component of dataframe_3 and then print it.

print(dataframe_3[c(TRUE, FALSE, TRUE, TRUE, FALSE),c(TRUE, TRUE, TRUE, FALSE)])

##         NAME ID INTEREST
## row_1 Andrew 10    movie
## row_3  Fanny 12    dance
## row_4   Iris 13     swim

Example 2: We select all rows, 2 element logical vector for the column index and we find it is recycled to 4 element logical vector.

print(dataframe_3[,c(TRUE, FALSE)])

##         NAME INTEREST
## row_1 Andrew    movie
## row_2   Zada    music
## row_3  Fanny    dance
## row_4   Iris     swim
## row_5   Jack  reading

Example 3: We select all columns where ID > 12 and print it.

print(dataframe_3[dataframe_3 $ID > 12,])

##       NAME ID INTEREST FROM_CHINA
## row_4 Iris 13     swim      FALSE
## row_5 Jack 14  reading       TRUE

Advanced method 2: Use function subset() to select a subset from a data frame according to whether or not a certain condition is true. Here we subset the dataframe_3 where ID is above 12 and then print it.
```
print(subset(dataframe_3, subset = dataframe_3 $ ID > 12))
```
```
##       NAME ID INTEREST FROM_CHINA
## row_4 Iris 13     swim      FALSE
## row_5 Jack 14  reading       TRUE
```

After understanding how to access the component of a data frame, you may want more exercises.

Exercise 1: Retrieve data value from row 1, column 2 and then print it

print(dataframe_3[1,2])

## [1] 10

Exercise 2: Retrieve data value from row 1, column 2 using the names and print it

print(dataframe_3['row_1','ID'])

## [1] 10

Exercise 3: Retrieve data of a row eg row_4 and print it

print(dataframe_3['row_4',])

##       NAME ID INTEREST FROM_CHINA
## row_4 Iris 13     swim      FALSE

Exercise 4: Retrieve two rows and then print it

print(dataframe_3[c('row_1','row_3'),])

##         NAME ID INTEREST FROM_CHINA
## row_1 Andrew 10    movie       TRUE
## row_3  Fanny 12    dance       TRUE

Exercise 5: Retrieve data for a column, print it and its class

print(dataframe_3[['INTEREST']])

## [1] "movie"   "music"   "dance"   "swim"    "reading"

print(class(dataframe_3[['INTEREST']]))

## [1] "character"

Exercise 6: Use drop when retrieving data using names, print it and its class

print(dataframe_3[,'INTEREST', drop = FALSE])

##       INTEREST
## row_1    movie
## row_2    music
## row_3    dance
## row_4     swim
## row_5  reading

print(class(dataframe_3[,'INTEREST', drop = FALSE]))

## [1] "data.frame"

7.Modifying, adding and removing a component of a data frame

Modify a component of a data frame: Just like we modify a matrix through reassignment. Here we modify the value in first row, column NAME to the string Zhongliang and then print dataframe_3.

dataframe_3[1, 'NAME'] <- 'ZHONGLIANG'
print(dataframe_3)

##             NAME ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG 10    movie       TRUE
## row_2       Zada 11    music      FALSE
## row_3      Fanny 12    dance       TRUE
## row_4       Iris 13     swim      FALSE
## row_5       Jack 14  reading       TRUE

Add a component of a data frame: Use rbind() to add a row and cbind() to add a column

Add a new row to dataframe_3 using rbind() and then print it

print(rbind(dataframe_3, row_6 = c('Trump', 100, 'joke',FALSE)))

##             NAME  ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG  10    movie       TRUE
## row_2       Zada  11    music      FALSE
## row_3      Fanny  12    dance       TRUE
## row_4       Iris  13     swim      FALSE
## row_5       Jack  14  reading       TRUE
## row_6      Trump 100     joke      FALSE

Add a new columns to dataframe_3 using cbind() and print it

print(cbind(dataframe_3, GRADE = c(100, 87,67,22, 99)))

##             NAME ID INTEREST FROM_CHINA GRADE
## row_1 ZHONGLIANG 10    movie       TRUE   100
## row_2       Zada 11    music      FALSE    87
## row_3      Fanny 12    dance       TRUE    67
## row_4       Iris 13     swim      FALSE    22
## row_5       Jack 14  reading       TRUE    99

Print the data frame dataframe_3

print(dataframe_3)

##             NAME ID INTEREST FROM_CHINA
## row_1 ZHONGLIANG 10    movie       TRUE
## row_2       Zada 11    music      FALSE
## row_3      Fanny 12    dance       TRUE
## row_4       Iris 13     swim      FALSE
## row_5       Jack 14  reading       TRUE

Note: Adding a new column through list-like assignment is also accepted since a data frame is implemented as a list. We add a new column AGE and print dataframe_3.

dataframe_3 $ AGE <- c(24, 31, 16,42,50)
print(dataframe_3)

##             NAME ID INTEREST FROM_CHINA AGE
## row_1 ZHONGLIANG 10    movie       TRUE  24
## row_2       Zada 11    music      FALSE  31
## row_3      Fanny 12    dance       TRUE  16
## row_4       Iris 13     swim      FALSE  42
## row_5       Jack 14  reading       TRUE  50

From above code, we could see that cbind() doesn’t add new column to the original data frame while the assignment operation does.

Remove a component of a data frame: You could also remove a column or row of a data frame.

Removing a column from a data frame is the same way as the list. Here we remove column ID and then print dataframe_3.

dataframe_3 $ ID <- NULL
print(dataframe_3)

##             NAME INTEREST FROM_CHINA AGE
## row_1 ZHONGLIANG    movie       TRUE  24
## row_2       Zada    music      FALSE  31
## row_3      Fanny    dance       TRUE  16
## row_4       Iris     swim      FALSE  42
## row_5       Jack  reading       TRUE  50

Remove a row from a data frame through reassignments. Here we select all rows except the first row and assign it to dataframe_3, then print dataframe_3.

dataframe_3 <- dataframe_3[-1,]
print(dataframe_3)

##        NAME INTEREST FROM_CHINA AGE
## row_2  Zada    music      FALSE  31
## row_3 Fanny    dance       TRUE  16
## row_4  Iris     swim      FALSE  42
## row_5  Jack  reading       TRUE  50

8.Sorting a data frame

You probably notice all values in column AGE are integers and you want to sort dataframe_3 based on the column AGE.
Function order() could give you the ranked position of every element when applied on a variable.

Print the Column AGE in dataframe_3
```
print(dataframe_3 $ AGE)
```
```
## [1] 31 16 42 50
```
Order the value in column AGE of dataframe_3 and then print
```
AGE_oder <- order(dataframe_3 $ AGE)
print(AGE_oder)
```
```
## [1] 2 1 3 4
```
As you see 16 in the column AGE is the smallest, it ranks 1. After understanding the order() function, we could rearrange dataframe_3 so that it begins with the smallest age and ends with oldest one. Then we print the ordered dataframe_3.
```
dataframe_3 <- dataframe_3[AGE_oder,]
print(dataframe_3)
```
```
##        NAME INTEREST FROM_CHINA AGE
## row_3 Fanny    dance       TRUE  16
## row_2  Zada    music      FALSE  31
## row_4  Iris     swim      FALSE  42
## row_5  Jack  reading       TRUE  50
```