從一筆資料取出部分資料,稱為資料的子集合動作(Subsetting)。子集合可幫助 後續的資料分析,因為很多資料裡面只有一部分是我們所需要的。
在DSC2014Tutorial的套件中,輸入slides(‘Basic’),就可以學習到許多資料子集合的技巧。
which()
函數方式:## [1] "Alabama"
## [1] "AL" "AK" "AZ" "AR"
## [1] 3.4
which()
設定條件篩選向量內的元素。例如我們想找出哪些美國的州名是以B跟C開頭,可寫語法如下:## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA"
## character(0)
## [1] "CA" "CO" "CT"
substr(A, i, j)
這個指令,A是字串向量,i是開始擷取向量內的文字的順位,j是結束擷取的位置,我們擷取每一個州名稱的第一個字母,然後存成用state.abb.abb這個向量,再用which()
函數,對原本的state.abb向量配對。## [1] "list"
## $height
## [1] 90
##
## $width
## [1] 120
## [1] "AL" "AK"
## $data
## [1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876
## [11] 6450 83557 56400 36291 56290 82264 40395 48523 33215 10577
## [21] 8257 58216 84068 47716 69686 147138 77227 110540 9304 7836
## [31] 121666 49576 52586 70665 41222 69919 96981 45333 1214 31055
## [41] 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914
## $height
## [1] 90
## [1] 120
which()
篩選列表的資料之前,需要先轉換列表為資料框,然後把因素轉換為數值,而R
會按照因素的層級排序,從1開始:## [1] "90" "120" "AL" "AK" "51609" "589757" "113909" "53104"
## [9] "158693" "104247" "5009" "2057" "58560" "58876" "6450" "83557"
## [17] "56400" "36291" "56290" "82264" "40395" "48523" "33215" "10577"
## [25] "8257" "58216" "84068" "47716" "69686" "147138" "77227" "110540"
## [33] "9304" "7836" "121666" "49576" "52586" "70665" "41222" "69919"
## [41] "96981" "45333" "1214" "31055" "77047" "42244" "267339" "84916"
## [49] "9609" "40815" "68192" "24181" "56154" "97914"
## [1] 90 120 NA NA 51609 589757
## [1] value data
## <0 rows> (or 0-length row.names)
矩陣可以表示為:
\(x_{11},\ldots, x_{r1}\)來自同樣的行(column),而\(x_{11},\ldots, x_{1c}\)來自同樣的列,所以前者可以用\(x_{,1}\)表示,後者\(x_{1,}\)表示。
假設有一個\(3\times 3\)的矩陣,我們用括號來取出其中的一個或是多個數值,或者加以替換:
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [1] 5
## [1] 1 2
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
## [1] 1 2 3
## [,1] [,2] [,3]
## [1,] "1" "4" "7"
## [2,] "2" "5" "8"
## [3,] "3" "6" "Hello"
R
可以傳回每一個符合條件的行與列的對應資料,例如:## [,1] [,2]
## [1,] 2 4
## [2,] 3 4
## [3,] 4 4
## [4,] 5 4
# Create a 3-dimensional array
arr <- array(sample(1:30, 24), dim = c(2, 2, 3))
# Print the array
print(arr)
## , , 1
##
## [,1] [,2]
## [1,] 1 14
## [2,] 7 11
##
## , , 2
##
## [,1] [,2]
## [1,] 30 12
## [2,] 2 23
##
## , , 3
##
## [,1] [,2]
## [1,] 8 9
## [2,] 26 15
## [,1] [,2] [,3]
## [1,] 1 2 1
## [2,] 2 2 1
## [3,] 1 2 2
## [4,] 2 2 3
## [1] "extra" "group" "ID"
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1 -0.1
## [16] 4.4 5.5 1.6 4.6 3.4
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
which()
可容納超過一個以上的條件,而且可以用「且」或是「或」連結條件與條件。針對資料框,可以比對出符合每一個條件的「列」,並且傳回向量。研究者可以由此取出對應的列。## extra group ID
## 1 0.7 1 1
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 10 2.0 1 10
which()
,可得到相同結果:## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
dplyr
這個套件裡面有select
以及filter
兩個函數,可以選擇特別變數,或者符合特定條件的觀察值。## Species
## 1 setosa
## 2 setosa
## 3 setosa
## 4 setosa
## 5 setosa
## 6 setosa
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
## ALPSeats LPSeats
## 1 47 55
## 2 52 52
## 3 57 47
## 4 47 57
## 5 45 58
## 6 60 45
select_if
可以篩選特定類型資料的變數,例如:##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Species
## 1 setosa
## 2 setosa
## 3 setosa
## 4 setosa
## 5 setosa
## 6 setosa
baseketball <- data.frame(team=c('Crab', 'Beaver', 'Apollo', 'Easy', 'Falcon'),
conference=as.factor(c('W', 'S', 'S', 'E', 'E')),
game.1=c(79, 77, 86, 88, 95),
game.2=c(91, 85, 88, 86, 83),
game.3=c(71, 105, 82, 85, 103))
df.short <- baseketball %>% select_if(function(x) is.character(x) | all(x == .$game.3))
head(df.short)
## team game.3
## 1 Crab 71
## 2 Beaver 105
## 3 Apollo 82
## 4 Easy 85
## 5 Falcon 103
select_at
來篩選特定字串的變數,例如加上vars(starts_with())
,,可以一次選擇相關字串開頭的變數:## game.1 game.2 game.3
## 1 79 91 71
## 2 77 85 105
## 3 86 88 82
## 4 88 86 85
## 5 95 83 103
vars(ends_with())
,,可以一次選擇相關字串結尾的變數:## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## 4 4.6 1.5
## 5 5.0 1.4
## 6 5.4 1.7
vars(contains())
,,可以一次選擇包含相關字串的變數:## Seats ALPSeats LPSeats NPSeats OtherSeats
## 1 121 47 55 19 0
## 2 121 52 52 17 0
## 3 122 57 47 17 0
## 4 122 47 57 18 0
## 5 122 45 58 19 0
## 6 122 60 45 17 0
which
,可以用filter
這個函數。例如從library(carData)
Adler.filter <- Adler %>% filter(instruction == 'none' &
expectation == 'high' &
rating > 0)
head(Adler.filter)
## instruction expectation rating
## 1 none high 22
## 2 none high 3
## 3 none high 4
## 4 none high 9
讀取
讀取
請把美國的州名排成一個陣列,然後找出州名長度多於或等於13的州(提示:nchar()傳回字串的長度)
請讀取studentsfull.txt這個檔案,然後取出經濟系與化學系的學生資料。
請從dplyr
這個套件中的
請從kmed
這個套件中的
請從kmed
這個套件中的
請用dplyr
這個套件中的
請下載nycflights13
這個套件,然後篩選出flights
這筆資料裡面預定起飛時間是2013年10月1日下午1點到10月1日下午5點之間起飛(time_hour)的資料。(提示:用as.POSIXct
這個函數,並且注意時區)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## 最後更新時間: 2025-03-10 20:38:17