Homework 1

Use the dataset iris to answer the following:

Create two subsets (df1 and df2) of iris so that df1 contains flowers with above average Sepal.Length. df2 contains the flowers which are not in df1.

View(iris)
sepal_length_mean = mean(iris$Sepal.Length)
df1 <- subset(iris, iris$Sepal.Length > sepal_length_mean)
View(df1)
df2 <- subset(iris, iris$Sepal.Length <= sepal_length_mean)
View(df2)

For each Species, create a plot to show the relationship between Sepal.Length and Petal.Length. What is your conclusion?

plot(iris$Sepal.Length, iris$Petal.Length)

Figure 1: From a clustering perspective it is clear that there are two clusters in the figure. For the cluster of points in the lower left corner of the figure, the value of Petal.Length hardly increases as Sepal.Length increases. While for the cluster in the upper right corner of the figure, the value of Petal.Length increases as Sepal.Length increases, and this pattern of change seems to be a linear relationship.

For each Species, create a plot to show the relationship between Sepal.Width and Petal.Width. What is your conclusion?

plot(iris$Sepal.Width, iris$Petal.Width)

Figure 2: From the figure, we can see that all the scatters can be divided into two clusters, the top-left cluster and the bottom-right cluster. The lower-right cluster is characterized by the Sepal.Width being significantly wider than the Petal.Width, and the Petal.Width of this cluster hardly increases as the Sepal.Width.

Using the iris dataset, explain the use of these functions: cbind, rbind, and merge.

The use of the function cbind():

df3 <- cbind(iris[, 1:2], iris[, 3:5])
df3

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

For two data sets with the same number of rows and either the same or different number of columns, use cbind() to stack the two sets of data by column dimension. For example, we designed two datasets iris[,1:2] and iris[,3:5], which have the same number of columns and different number of rows.

The use of the function rbind():

df4 <- subset(iris, Species == "setosa")
df5 <- subset(iris, Species == "versicolor")
result1 <- rbind(df4, df5)
result1

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor

For two data sets with the same number of columns and either the same or different number of rows, use rbind() to stack the two sets of data by row dimension. For example, we designed two datasets df4 and df5, which have the same number of rows and different number of columns.

The use of the function merge():

# ?merge()

Merge two data frames by common columns or row names, or do other versions of database join operations.

# Get a subset of two iris data with the same column names.
df6 <- iris[c(1, 7, 6, 10, 24), 1:4]
df7 <- iris[c(2, 13, 18, 44), 4:5]
# Get the intersection between two data frames by merge() function:
df8 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width")
df8

##   Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1         0.1          4.9         3.1          1.5  setosa
## 2         0.2          5.1         3.5          1.4  setosa
## 3         0.3          4.6         3.4          1.4  setosa

by.x and by.y is actually used to tell the merge function we take out df6 by.x column and df7 by.y column with the same value in the row to merge, the other discarded, in addition, if one is specified, then the other must be specified at the same time, otherwise it will report an error.

# Get the union set between two data frames by merge() function, sorted by common column.
df9 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width", all=TRUE, sort=TRUE)
df9

##   Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1         0.1          4.9         3.1          1.5  setosa
## 2         0.2          5.1         3.5          1.4  setosa
## 3         0.3          4.6         3.4          1.4  setosa
## 4         0.4          5.4         3.9          1.7    <NA>
## 5         0.5          5.1         3.3          1.7    <NA>
## 6         0.6           NA          NA           NA  setosa

# Get the union set between two data frames by merge() function, not sorted by common column.
df10 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width", all=TRUE, sort=FALSE)
df10

##   Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1         0.2          5.1         3.5          1.4  setosa
## 2         0.3          4.6         3.4          1.4  setosa
## 3         0.1          4.9         3.1          1.5  setosa
## 4         0.4          5.4         3.9          1.7    <NA>
## 5         0.5          5.1         3.3          1.7    <NA>
## 6         0.6           NA          NA           NA  setosa

Explain the use of apply, lapply, tapply, and aggregate functions using iris dataset. Use the help available in RStudio to answer this question.

# ?apply()

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

The use of the function apply()

result2 <- apply(iris[1:50, 1:4], 2, mean)
result2

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.006        3.428        1.462        0.246

Take the first four columns of the first flower of iris data and find the average of each column. We cannot use the fifth column, because the fifth column is not a numeric type.

# Take the data in columns 2-4 of the first 10 rows of the second flower and find which value in each row is the largest.
result3 <- apply(iris[51:60, 2:4], 1, max)
result3

##  51  52  53  54  55  56  57  58  59  60 
## 4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9

# Find the standard deviation of the data in each column for the first flower.
result4 <- apply(iris[1:50, 1:4], 2, sd)
result4

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.3524897    0.3790644    0.1736640    0.1053856

# ?lapply()

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

The use of the function lapply()

# The lapply() function returns the result of the function in the form of a list。
result5 <- lapply(iris[1:50, 1:4], mean)
result5

## $Sepal.Length
## [1] 5.006
## 
## $Sepal.Width
## [1] 3.428
## 
## $Petal.Length
## [1] 1.462
## 
## $Petal.Width
## [1] 0.246

# Find the variance of all data in columns 1, 3, and 4 of the second flower, and present the results in a table.
result6 <- lapply(iris[51:100, c(1, 3, 4)], var)
result6

## $Sepal.Length
## [1] 0.2664327
## 
## $Petal.Length
## [1] 0.2208163
## 
## $Petal.Width
## [1] 0.03910612

# ?tapply()

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values or data rows given by a unique combination of the levels of certain factors.

The use of the function tapply()

# For the Petal.width property of the first flower, we index from 1-50, we divide into 5 groups of 10 each and find the average value of each group.
result7 <- tapply(iris[1:50, 4], rep(1:10, 5), mean)
result7

##    1    2    3    4    5    6    7    8    9   10 
## 0.22 0.30 0.16 0.32 0.24 0.30 0.30 0.20 0.22 0.20

# For a total of three flowers, calculate the median Sepal.Width for each flower.
result8 <- tapply(iris$Sepal.Width, iris$Species, median)
result8

##     setosa versicolor  virginica 
##        3.4        2.8        3.0

# Total of three flowers, calculate the quantile of Septal.Length for each flower.
result9 <- tapply(iris$Petal.Length, iris$Species, quantile)
result9

## $setosa
##    0%   25%   50%   75%  100% 
## 1.000 1.400 1.500 1.575 1.900 
## 
## $versicolor
##   0%  25%  50%  75% 100% 
## 3.00 4.00 4.35 4.60 5.10 
## 
## $virginica
##    0%   25%   50%   75%  100% 
## 4.500 5.100 5.550 5.875 6.900

# ?aggregate()

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

The use of the function aggregate()

# compute mean Sepal.Length per Species
sepallength.per.species <- aggregate(iris$Sepal.Length, by=list(iris$Species), FUN=mean)
sepallength.per.species

##      Group.1     x
## 1     setosa 5.006
## 2 versicolor 5.936
## 3  virginica 6.588

# compute sum Sepal.Width per Species
sepalwidth.per.species <- aggregate(iris$Sepal.Width, by=list(iris$Species), FUN=sum)
sepalwidth.per.species

##      Group.1     x
## 1     setosa 171.4
## 2 versicolor 138.5
## 3  virginica 148.7

# compute range Petal.Length per Species
Petallength.per.species <- aggregate(iris$Petal.Length, by=list(iris$Species), FUN=range)
Petallength.per.species

##      Group.1 x.1 x.2
## 1     setosa 1.0 1.9
## 2 versicolor 3.0 5.1
## 3  virginica 4.5 6.9

# compute quantile Petal.With per Species
Petalwidth.per.species <- aggregate(iris$Petal.Width, by=list(iris$Species), FUN=quantile)
Petalwidth.per.species

##      Group.1 x.0% x.25% x.50% x.75% x.100%
## 1     setosa  0.1   0.2   0.2   0.3    0.6
## 2 versicolor  1.0   1.2   1.3   1.5    1.8
## 3  virginica  1.4   1.8   2.0   2.3    2.5

Haiyuan Gui’s Document

ghy

2023-07-06