Use the dataset iris to answer the following:
View(iris)
sepal_length_mean = mean(iris$Sepal.Length)
df1 <- subset(iris, iris$Sepal.Length > sepal_length_mean)
View(df1)
df2 <- subset(iris, iris$Sepal.Length <= sepal_length_mean)
View(df2)
plot(iris$Sepal.Length, iris$Petal.Length)
Figure 1: From a clustering perspective it is clear that there are two clusters in the figure. For the cluster of points in the lower left corner of the figure, the value of Petal.Length hardly increases as Sepal.Length increases. While for the cluster in the upper right corner of the figure, the value of Petal.Length increases as Sepal.Length increases, and this pattern of change seems to be a linear relationship.
plot(iris$Sepal.Width, iris$Petal.Width)
Figure 2: From the figure, we can see that all the scatters can be divided into two clusters, the top-left cluster and the bottom-right cluster. The lower-right cluster is characterized by the Sepal.Width being significantly wider than the Petal.Width, and the Petal.Width of this cluster hardly increases as the Sepal.Width.
df3 <- cbind(iris[, 1:2], iris[, 3:5])
df3
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
For two data sets with the same number of rows and either the same or different number of columns, use cbind() to stack the two sets of data by column dimension. For example, we designed two datasets iris[,1:2] and iris[,3:5], which have the same number of columns and different number of rows.
df4 <- subset(iris, Species == "setosa")
df5 <- subset(iris, Species == "versicolor")
result1 <- rbind(df4, df5)
result1
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
For two data sets with the same number of columns and either the same or different number of rows, use rbind() to stack the two sets of data by row dimension. For example, we designed two datasets df4 and df5, which have the same number of rows and different number of columns.
# ?merge()
Merge two data frames by common columns or row names, or do other versions of database join operations.
# Get a subset of two iris data with the same column names.
df6 <- iris[c(1, 7, 6, 10, 24), 1:4]
df7 <- iris[c(2, 13, 18, 44), 4:5]
# Get the intersection between two data frames by merge() function:
df8 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width")
df8
## Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1 0.1 4.9 3.1 1.5 setosa
## 2 0.2 5.1 3.5 1.4 setosa
## 3 0.3 4.6 3.4 1.4 setosa
by.x and by.y is actually used to tell the merge function we take out df6 by.x column and df7 by.y column with the same value in the row to merge, the other discarded, in addition, if one is specified, then the other must be specified at the same time, otherwise it will report an error.
# Get the union set between two data frames by merge() function, sorted by common column.
df9 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width", all=TRUE, sort=TRUE)
df9
## Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1 0.1 4.9 3.1 1.5 setosa
## 2 0.2 5.1 3.5 1.4 setosa
## 3 0.3 4.6 3.4 1.4 setosa
## 4 0.4 5.4 3.9 1.7 <NA>
## 5 0.5 5.1 3.3 1.7 <NA>
## 6 0.6 NA NA NA setosa
# Get the union set between two data frames by merge() function, not sorted by common column.
df10 <- merge(df6, df7, by.x="Petal.Width", by.y="Petal.Width", all=TRUE, sort=FALSE)
df10
## Petal.Width Sepal.Length Sepal.Width Petal.Length Species
## 1 0.2 5.1 3.5 1.4 setosa
## 2 0.3 4.6 3.4 1.4 setosa
## 3 0.1 4.9 3.1 1.5 setosa
## 4 0.4 5.4 3.9 1.7 <NA>
## 5 0.5 5.1 3.3 1.7 <NA>
## 6 0.6 NA NA NA setosa
# ?apply()
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
result2 <- apply(iris[1:50, 1:4], 2, mean)
result2
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.006 3.428 1.462 0.246
Take the first four columns of the first flower of iris data and find the average of each column. We cannot use the fifth column, because the fifth column is not a numeric type.
# Take the data in columns 2-4 of the first 10 rows of the second flower and find which value in each row is the largest.
result3 <- apply(iris[51:60, 2:4], 1, max)
result3
## 51 52 53 54 55 56 57 58 59 60
## 4.7 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9
# Find the standard deviation of the data in each column for the first flower.
result4 <- apply(iris[1:50, 1:4], 2, sd)
result4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.3524897 0.3790644 0.1736640 0.1053856
# ?lapply()
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
# The lapply() function returns the result of the function in the form of a list。
result5 <- lapply(iris[1:50, 1:4], mean)
result5
## $Sepal.Length
## [1] 5.006
##
## $Sepal.Width
## [1] 3.428
##
## $Petal.Length
## [1] 1.462
##
## $Petal.Width
## [1] 0.246
# Find the variance of all data in columns 1, 3, and 4 of the second flower, and present the results in a table.
result6 <- lapply(iris[51:100, c(1, 3, 4)], var)
result6
## $Sepal.Length
## [1] 0.2664327
##
## $Petal.Length
## [1] 0.2208163
##
## $Petal.Width
## [1] 0.03910612
# ?tapply()
Apply a function to each cell of a ragged array, that is to each (non-empty) group of values or data rows given by a unique combination of the levels of certain factors.
# For the Petal.width property of the first flower, we index from 1-50, we divide into 5 groups of 10 each and find the average value of each group.
result7 <- tapply(iris[1:50, 4], rep(1:10, 5), mean)
result7
## 1 2 3 4 5 6 7 8 9 10
## 0.22 0.30 0.16 0.32 0.24 0.30 0.30 0.20 0.22 0.20
# For a total of three flowers, calculate the median Sepal.Width for each flower.
result8 <- tapply(iris$Sepal.Width, iris$Species, median)
result8
## setosa versicolor virginica
## 3.4 2.8 3.0
# Total of three flowers, calculate the quantile of Septal.Length for each flower.
result9 <- tapply(iris$Petal.Length, iris$Species, quantile)
result9
## $setosa
## 0% 25% 50% 75% 100%
## 1.000 1.400 1.500 1.575 1.900
##
## $versicolor
## 0% 25% 50% 75% 100%
## 3.00 4.00 4.35 4.60 5.10
##
## $virginica
## 0% 25% 50% 75% 100%
## 4.500 5.100 5.550 5.875 6.900
# ?aggregate()
Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
# compute mean Sepal.Length per Species
sepallength.per.species <- aggregate(iris$Sepal.Length, by=list(iris$Species), FUN=mean)
sepallength.per.species
## Group.1 x
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
# compute sum Sepal.Width per Species
sepalwidth.per.species <- aggregate(iris$Sepal.Width, by=list(iris$Species), FUN=sum)
sepalwidth.per.species
## Group.1 x
## 1 setosa 171.4
## 2 versicolor 138.5
## 3 virginica 148.7
# compute range Petal.Length per Species
Petallength.per.species <- aggregate(iris$Petal.Length, by=list(iris$Species), FUN=range)
Petallength.per.species
## Group.1 x.1 x.2
## 1 setosa 1.0 1.9
## 2 versicolor 3.0 5.1
## 3 virginica 4.5 6.9
# compute quantile Petal.With per Species
Petalwidth.per.species <- aggregate(iris$Petal.Width, by=list(iris$Species), FUN=quantile)
Petalwidth.per.species
## Group.1 x.0% x.25% x.50% x.75% x.100%
## 1 setosa 0.1 0.2 0.2 0.3 0.6
## 2 versicolor 1.0 1.2 1.3 1.5 1.8
## 3 virginica 1.4 1.8 2.0 2.3 2.5