Data Manipulation in R

Nathan Byers
May 30, 2014

Topics

Subsetting

Subsetting

  • We'll begin by loading the airquality data frame
data(airquality)
  • For a description of the data frame, type ?airquality

Subsetting

First, the best way to take a quick look at the top of a data frame is to use the head() function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Subsetting

You can specify the number of lines to display by using the n = parameter

head(airquality, n = 3)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3

Subsetting

You can also look at the bottom of the data frame by using tail()

tail(airquality)
    Ozone Solar.R Wind Temp Month Day
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

Subsetting

  • First we'll subset this data frame by using the [ function, i.e. brackets
airquality[ , ]
  • Remember, the convention is [rows, columns]

Subsetting

To get one row of the data frame, specify the row number you would like in the brackets, on the left side of the comma

airquality[1, ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1

Subsetting

If you want more than one row, you can supply a vector of row numbers

airquality[c(1, 2, 3), ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3

Subsetting

To get a column from the data frame, specify the column number in the brackets, to the right of the comma (it's returned as a vector)

airquality[, 1]
  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34
 [18]   6  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA
 [35]  NA  NA  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13
 [52]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77
 [69]  97  97  85  NA  10  27  NA   7  48  35  61  79  63  16  NA  NA  80
 [86] 108  20  52  82  50  64  59  39   9  16  78  35  66 122  89 110  NA
[103]  NA  44  28  65  NA  22  59  23  31  44  21   9  NA  45 168  73  NA
[120]  76 118  84  85  96  78  73  91  47  32  20  23  21  24  44  21  28
[137]   9  13  46  18  13  24  16  13  23  36   7  14  30  NA  14  18  20

Subsetting

You can obtain more than one column by supplying a vector of column numbers

airquality[, c(3, 4, 5)]
    Wind Temp Month
1    7.4   67     5
2    8.0   72     5
3   12.6   74     5
4   11.5   62     5
5   14.3   56     5
6   14.9   66     5
7    8.6   65     5
8   13.8   59     5
9   20.1   61     5
10   8.6   69     5
11   6.9   74     5
12   9.7   69     5
13   9.2   66     5
14  10.9   68     5
15  13.2   58     5
16  11.5   64     5
17  12.0   66     5
18  18.4   57     5
19  11.5   68     5
20   9.7   62     5
21   9.7   59     5
22  16.6   73     5
23   9.7   61     5
24  12.0   61     5
25  16.6   57     5
26  14.9   58     5
27   8.0   57     5
28  12.0   67     5
29  14.9   81     5
30   5.7   79     5
31   7.4   76     5
32   8.6   78     6
33   9.7   74     6
34  16.1   67     6
35   9.2   84     6
36   8.6   85     6
37  14.3   79     6
38   9.7   82     6
39   6.9   87     6
40  13.8   90     6
41  11.5   87     6
42  10.9   93     6
43   9.2   92     6
44   8.0   82     6
45  13.8   80     6
46  11.5   79     6
47  14.9   77     6
48  20.7   72     6
49   9.2   65     6
50  11.5   73     6
51  10.3   76     6
52   6.3   77     6
53   1.7   76     6
54   4.6   76     6
55   6.3   76     6
56   8.0   75     6
57   8.0   78     6
58  10.3   73     6
59  11.5   80     6
60  14.9   77     6
61   8.0   83     6
62   4.1   84     7
63   9.2   85     7
64   9.2   81     7
65  10.9   84     7
66   4.6   83     7
67  10.9   83     7
68   5.1   88     7
69   6.3   92     7
70   5.7   92     7
71   7.4   89     7
72   8.6   82     7
73  14.3   73     7
74  14.9   81     7
75  14.9   91     7
76  14.3   80     7
77   6.9   81     7
78  10.3   82     7
79   6.3   84     7
80   5.1   87     7
81  11.5   85     7
82   6.9   74     7
83   9.7   81     7
84  11.5   82     7
85   8.6   86     7
86   8.0   85     7
87   8.6   82     7
88  12.0   86     7
89   7.4   88     7
90   7.4   86     7
91   7.4   83     7
92   9.2   81     7
93   6.9   81     8
94  13.8   81     8
95   7.4   82     8
96   6.9   86     8
97   7.4   85     8
98   4.6   87     8
99   4.0   89     8
100 10.3   90     8
101  8.0   90     8
102  8.6   92     8
103 11.5   86     8
104 11.5   86     8
105 11.5   82     8
106  9.7   80     8
107 11.5   79     8
108 10.3   77     8
109  6.3   79     8
110  7.4   76     8
111 10.9   78     8
112 10.3   78     8
113 15.5   77     8
114 14.3   72     8
115 12.6   75     8
116  9.7   79     8
117  3.4   81     8
118  8.0   86     8
119  5.7   88     8
120  9.7   97     8
121  2.3   94     8
122  6.3   96     8
123  6.3   94     8
124  6.9   91     9
125  5.1   92     9
126  2.8   93     9
127  4.6   93     9
128  7.4   87     9
129 15.5   84     9
130 10.9   80     9
131 10.3   78     9
132 10.9   75     9
133  9.7   73     9
134 14.9   81     9
135 15.5   76     9
136  6.3   77     9
137 10.9   71     9
138 11.5   71     9
139  6.9   78     9
140 13.8   67     9
141 10.3   76     9
142 10.3   68     9
143  8.0   82     9
144 12.6   64     9
145  9.2   71     9
146 10.3   81     9
147 10.3   69     9
148 16.6   63     9
149  6.9   70     9
150 13.2   77     9
151 14.3   75     9
152  8.0   76     9
153 11.5   68     9

Subsetting

Column names can also be used

airquality[, "Solar.R"]
  [1] 190 118 149 313  NA  NA 299  99  19 194  NA 256 290 274  65 334 307
 [18]  78 322  44   8 320  25  92  66 266  NA  13 252 223 279 286 287 242
 [35] 186 220 264 127 273 291 323 259 250 148 332 322 191 284  37 120 137
 [52] 150  59  91 250 135 127  47  98  31 138 269 248 236 101 175 314 276
 [69] 267 272 175 139 264 175 291  48 260 274 285 187 220   7 258 295 294
 [86] 223  81  82 213 275 253 254  83  24  77  NA  NA  NA 255 229 207 222
[103] 137 192 273 157  64  71  51 115 244 190 259  36 255 212 238 215 153
[120] 203 225 237 188 167 197 183 189  95  92 252 220 230 259 236 259 238
[137]  24 112 237 224  27 238 201 238  14 139  49  20 193 145 191 131 223

Subsetting

Or a vector of column names

airquality[, c("Ozone", "Temp", "Day")]
    Ozone Temp Day
1      41   67   1
2      36   72   2
3      12   74   3
4      18   62   4
5      NA   56   5
6      28   66   6
7      23   65   7
8      19   59   8
9       8   61   9
10     NA   69  10
11      7   74  11
12     16   69  12
13     11   66  13
14     14   68  14
15     18   58  15
16     14   64  16
17     34   66  17
18      6   57  18
19     30   68  19
20     11   62  20
21      1   59  21
22     11   73  22
23      4   61  23
24     32   61  24
25     NA   57  25
26     NA   58  26
27     NA   57  27
28     23   67  28
29     45   81  29
30    115   79  30
31     37   76  31
32     NA   78   1
33     NA   74   2
34     NA   67   3
35     NA   84   4
36     NA   85   5
37     NA   79   6
38     29   82   7
39     NA   87   8
40     71   90   9
41     39   87  10
42     NA   93  11
43     NA   92  12
44     23   82  13
45     NA   80  14
46     NA   79  15
47     21   77  16
48     37   72  17
49     20   65  18
50     12   73  19
51     13   76  20
52     NA   77  21
53     NA   76  22
54     NA   76  23
55     NA   76  24
56     NA   75  25
57     NA   78  26
58     NA   73  27
59     NA   80  28
60     NA   77  29
61     NA   83  30
62    135   84   1
63     49   85   2
64     32   81   3
65     NA   84   4
66     64   83   5
67     40   83   6
68     77   88   7
69     97   92   8
70     97   92   9
71     85   89  10
72     NA   82  11
73     10   73  12
74     27   81  13
75     NA   91  14
76      7   80  15
77     48   81  16
78     35   82  17
79     61   84  18
80     79   87  19
81     63   85  20
82     16   74  21
83     NA   81  22
84     NA   82  23
85     80   86  24
86    108   85  25
87     20   82  26
88     52   86  27
89     82   88  28
90     50   86  29
91     64   83  30
92     59   81  31
93     39   81   1
94      9   81   2
95     16   82   3
96     78   86   4
97     35   85   5
98     66   87   6
99    122   89   7
100    89   90   8
101   110   90   9
102    NA   92  10
103    NA   86  11
104    44   86  12
105    28   82  13
106    65   80  14
107    NA   79  15
108    22   77  16
109    59   79  17
110    23   76  18
111    31   78  19
112    44   78  20
113    21   77  21
114     9   72  22
115    NA   75  23
116    45   79  24
117   168   81  25
118    73   86  26
119    NA   88  27
120    76   97  28
121   118   94  29
122    84   96  30
123    85   94  31
124    96   91   1
125    78   92   2
126    73   93   3
127    91   93   4
128    47   87   5
129    32   84   6
130    20   80   7
131    23   78   8
132    21   75   9
133    24   73  10
134    44   81  11
135    21   76  12
136    28   77  13
137     9   71  14
138    13   71  15
139    46   78  16
140    18   67  17
141    13   76  18
142    24   68  19
143    16   82  20
144    13   64  21
145    23   71  22
146    36   81  23
147     7   69  24
148    14   63  25
149    30   70  26
150    NA   77  27
151    14   75  28
152    18   76  29
153    20   68  30

Subsetting

Both rows and columns can be specified

airquality[1:5, 3:5]
  Wind Temp Month
1  7.4   67     5
2  8.0   72     5
3 12.6   74     5
4 11.5   62     5
5 14.3   56     5

Subsetting

  • A data frame can also be subset by using logical expressions
  • The logical expression is used to specify rows that you want to keep
airquality[(logical expression), ]

Subsetting

  • For example, let's say we only want rows in this data frame where ozone was above 15ppb
  • (For convenience, and to save space, I rename the data frame)
aq <- airquality
aq[(aq$Ozone > 15), ]
      Ozone Solar.R Wind Temp Month Day
1        41     190  7.4   67     5   1
2        36     118  8.0   72     5   2
4        18     313 11.5   62     5   4
NA       NA      NA   NA   NA    NA  NA
6        28      NA 14.9   66     5   6
7        23     299  8.6   65     5   7
8        19      99 13.8   59     5   8
NA.1     NA      NA   NA   NA    NA  NA
12       16     256  9.7   69     5  12
15       18      65 13.2   58     5  15
17       34     307 12.0   66     5  17
19       30     322 11.5   68     5  19
24       32      92 12.0   61     5  24
NA.2     NA      NA   NA   NA    NA  NA
NA.3     NA      NA   NA   NA    NA  NA
NA.4     NA      NA   NA   NA    NA  NA
28       23      13 12.0   67     5  28
29       45     252 14.9   81     5  29
30      115     223  5.7   79     5  30
31       37     279  7.4   76     5  31
NA.5     NA      NA   NA   NA    NA  NA
NA.6     NA      NA   NA   NA    NA  NA
NA.7     NA      NA   NA   NA    NA  NA
NA.8     NA      NA   NA   NA    NA  NA
NA.9     NA      NA   NA   NA    NA  NA
NA.10    NA      NA   NA   NA    NA  NA
38       29     127  9.7   82     6   7
NA.11    NA      NA   NA   NA    NA  NA
40       71     291 13.8   90     6   9
41       39     323 11.5   87     6  10
NA.12    NA      NA   NA   NA    NA  NA
NA.13    NA      NA   NA   NA    NA  NA
44       23     148  8.0   82     6  13
NA.14    NA      NA   NA   NA    NA  NA
NA.15    NA      NA   NA   NA    NA  NA
47       21     191 14.9   77     6  16
48       37     284 20.7   72     6  17
49       20      37  9.2   65     6  18
NA.16    NA      NA   NA   NA    NA  NA
NA.17    NA      NA   NA   NA    NA  NA
NA.18    NA      NA   NA   NA    NA  NA
NA.19    NA      NA   NA   NA    NA  NA
NA.20    NA      NA   NA   NA    NA  NA
NA.21    NA      NA   NA   NA    NA  NA
NA.22    NA      NA   NA   NA    NA  NA
NA.23    NA      NA   NA   NA    NA  NA
NA.24    NA      NA   NA   NA    NA  NA
NA.25    NA      NA   NA   NA    NA  NA
62      135     269  4.1   84     7   1
63       49     248  9.2   85     7   2
64       32     236  9.2   81     7   3
NA.26    NA      NA   NA   NA    NA  NA
66       64     175  4.6   83     7   5
67       40     314 10.9   83     7   6
68       77     276  5.1   88     7   7
69       97     267  6.3   92     7   8
70       97     272  5.7   92     7   9
71       85     175  7.4   89     7  10
NA.27    NA      NA   NA   NA    NA  NA
74       27     175 14.9   81     7  13
NA.28    NA      NA   NA   NA    NA  NA
77       48     260  6.9   81     7  16
78       35     274 10.3   82     7  17
79       61     285  6.3   84     7  18
80       79     187  5.1   87     7  19
81       63     220 11.5   85     7  20
82       16       7  6.9   74     7  21
NA.29    NA      NA   NA   NA    NA  NA
NA.30    NA      NA   NA   NA    NA  NA
85       80     294  8.6   86     7  24
86      108     223  8.0   85     7  25
87       20      81  8.6   82     7  26
88       52      82 12.0   86     7  27
89       82     213  7.4   88     7  28
90       50     275  7.4   86     7  29
91       64     253  7.4   83     7  30
92       59     254  9.2   81     7  31
93       39      83  6.9   81     8   1
95       16      77  7.4   82     8   3
96       78      NA  6.9   86     8   4
97       35      NA  7.4   85     8   5
98       66      NA  4.6   87     8   6
99      122     255  4.0   89     8   7
100      89     229 10.3   90     8   8
101     110     207  8.0   90     8   9
NA.31    NA      NA   NA   NA    NA  NA
NA.32    NA      NA   NA   NA    NA  NA
104      44     192 11.5   86     8  12
105      28     273 11.5   82     8  13
106      65     157  9.7   80     8  14
NA.33    NA      NA   NA   NA    NA  NA
108      22      71 10.3   77     8  16
109      59      51  6.3   79     8  17
110      23     115  7.4   76     8  18
111      31     244 10.9   78     8  19
112      44     190 10.3   78     8  20
113      21     259 15.5   77     8  21
NA.34    NA      NA   NA   NA    NA  NA
116      45     212  9.7   79     8  24
117     168     238  3.4   81     8  25
118      73     215  8.0   86     8  26
NA.35    NA      NA   NA   NA    NA  NA
120      76     203  9.7   97     8  28
121     118     225  2.3   94     8  29
122      84     237  6.3   96     8  30
123      85     188  6.3   94     8  31
124      96     167  6.9   91     9   1
125      78     197  5.1   92     9   2
126      73     183  2.8   93     9   3
127      91     189  4.6   93     9   4
128      47      95  7.4   87     9   5
129      32      92 15.5   84     9   6
130      20     252 10.9   80     9   7
131      23     220 10.3   78     9   8
132      21     230 10.9   75     9   9
133      24     259  9.7   73     9  10
134      44     236 14.9   81     9  11
135      21     259 15.5   76     9  12
136      28     238  6.3   77     9  13
139      46     237  6.9   78     9  16
140      18     224 13.8   67     9  17
142      24     238 10.3   68     9  19
143      16     201  8.0   82     9  20
145      23      14  9.2   71     9  22
146      36     139 10.3   81     9  23
149      30     193  6.9   70     9  26
NA.36    NA      NA   NA   NA    NA  NA
152      18     131  8.0   76     9  29
153      20     223 11.5   68     9  30

Subsetting

  • The logical expression actually returns a logical vector
logical.vector <- aq$Ozone > 15
class(logical.vector)
[1] "logical"
logical.vector[1:5] # return the first 5
[1]  TRUE  TRUE FALSE  TRUE    NA

Subsetting

  • So, the way the logical vector subsets the data frame is by providing a vector that indicates if a row should be kept (TRUE) or dropped (FALSE)
  • We can use the variable logical.vector to do the same thing
aq[logical.vector, ]
      Ozone Solar.R Wind Temp Month Day
1        41     190  7.4   67     5   1
2        36     118  8.0   72     5   2
4        18     313 11.5   62     5   4
NA       NA      NA   NA   NA    NA  NA
6        28      NA 14.9   66     5   6
7        23     299  8.6   65     5   7
8        19      99 13.8   59     5   8
NA.1     NA      NA   NA   NA    NA  NA
12       16     256  9.7   69     5  12
15       18      65 13.2   58     5  15
17       34     307 12.0   66     5  17
19       30     322 11.5   68     5  19
24       32      92 12.0   61     5  24
NA.2     NA      NA   NA   NA    NA  NA
NA.3     NA      NA   NA   NA    NA  NA
NA.4     NA      NA   NA   NA    NA  NA
28       23      13 12.0   67     5  28
29       45     252 14.9   81     5  29
30      115     223  5.7   79     5  30
31       37     279  7.4   76     5  31
NA.5     NA      NA   NA   NA    NA  NA
NA.6     NA      NA   NA   NA    NA  NA
NA.7     NA      NA   NA   NA    NA  NA
NA.8     NA      NA   NA   NA    NA  NA
NA.9     NA      NA   NA   NA    NA  NA
NA.10    NA      NA   NA   NA    NA  NA
38       29     127  9.7   82     6   7
NA.11    NA      NA   NA   NA    NA  NA
40       71     291 13.8   90     6   9
41       39     323 11.5   87     6  10
NA.12    NA      NA   NA   NA    NA  NA
NA.13    NA      NA   NA   NA    NA  NA
44       23     148  8.0   82     6  13
NA.14    NA      NA   NA   NA    NA  NA
NA.15    NA      NA   NA   NA    NA  NA
47       21     191 14.9   77     6  16
48       37     284 20.7   72     6  17
49       20      37  9.2   65     6  18
NA.16    NA      NA   NA   NA    NA  NA
NA.17    NA      NA   NA   NA    NA  NA
NA.18    NA      NA   NA   NA    NA  NA
NA.19    NA      NA   NA   NA    NA  NA
NA.20    NA      NA   NA   NA    NA  NA
NA.21    NA      NA   NA   NA    NA  NA
NA.22    NA      NA   NA   NA    NA  NA
NA.23    NA      NA   NA   NA    NA  NA
NA.24    NA      NA   NA   NA    NA  NA
NA.25    NA      NA   NA   NA    NA  NA
62      135     269  4.1   84     7   1
63       49     248  9.2   85     7   2
64       32     236  9.2   81     7   3
NA.26    NA      NA   NA   NA    NA  NA
66       64     175  4.6   83     7   5
67       40     314 10.9   83     7   6
68       77     276  5.1   88     7   7
69       97     267  6.3   92     7   8
70       97     272  5.7   92     7   9
71       85     175  7.4   89     7  10
NA.27    NA      NA   NA   NA    NA  NA
74       27     175 14.9   81     7  13
NA.28    NA      NA   NA   NA    NA  NA
77       48     260  6.9   81     7  16
78       35     274 10.3   82     7  17
79       61     285  6.3   84     7  18
80       79     187  5.1   87     7  19
81       63     220 11.5   85     7  20
82       16       7  6.9   74     7  21
NA.29    NA      NA   NA   NA    NA  NA
NA.30    NA      NA   NA   NA    NA  NA
85       80     294  8.6   86     7  24
86      108     223  8.0   85     7  25
87       20      81  8.6   82     7  26
88       52      82 12.0   86     7  27
89       82     213  7.4   88     7  28
90       50     275  7.4   86     7  29
91       64     253  7.4   83     7  30
92       59     254  9.2   81     7  31
93       39      83  6.9   81     8   1
95       16      77  7.4   82     8   3
96       78      NA  6.9   86     8   4
97       35      NA  7.4   85     8   5
98       66      NA  4.6   87     8   6
99      122     255  4.0   89     8   7
100      89     229 10.3   90     8   8
101     110     207  8.0   90     8   9
NA.31    NA      NA   NA   NA    NA  NA
NA.32    NA      NA   NA   NA    NA  NA
104      44     192 11.5   86     8  12
105      28     273 11.5   82     8  13
106      65     157  9.7   80     8  14
NA.33    NA      NA   NA   NA    NA  NA
108      22      71 10.3   77     8  16
109      59      51  6.3   79     8  17
110      23     115  7.4   76     8  18
111      31     244 10.9   78     8  19
112      44     190 10.3   78     8  20
113      21     259 15.5   77     8  21
NA.34    NA      NA   NA   NA    NA  NA
116      45     212  9.7   79     8  24
117     168     238  3.4   81     8  25
118      73     215  8.0   86     8  26
NA.35    NA      NA   NA   NA    NA  NA
120      76     203  9.7   97     8  28
121     118     225  2.3   94     8  29
122      84     237  6.3   96     8  30
123      85     188  6.3   94     8  31
124      96     167  6.9   91     9   1
125      78     197  5.1   92     9   2
126      73     183  2.8   93     9   3
127      91     189  4.6   93     9   4
128      47      95  7.4   87     9   5
129      32      92 15.5   84     9   6
130      20     252 10.9   80     9   7
131      23     220 10.3   78     9   8
132      21     230 10.9   75     9   9
133      24     259  9.7   73     9  10
134      44     236 14.9   81     9  11
135      21     259 15.5   76     9  12
136      28     238  6.3   77     9  13
139      46     237  6.9   78     9  16
140      18     224 13.8   67     9  17
142      24     238 10.3   68     9  19
143      16     201  8.0   82     9  20
145      23      14  9.2   71     9  22
146      36     139 10.3   81     9  23
149      30     193  6.9   70     9  26
NA.36    NA      NA   NA   NA    NA  NA
152      18     131  8.0   76     9  29
153      20     223 11.5   68     9  30

Subsetting

If we wanted all of the days in the 7th month, we could use ==

aq[(aq$Month == 7), ]
   Ozone Solar.R Wind Temp Month Day
62   135     269  4.1   84     7   1
63    49     248  9.2   85     7   2
64    32     236  9.2   81     7   3
65    NA     101 10.9   84     7   4
66    64     175  4.6   83     7   5
67    40     314 10.9   83     7   6
68    77     276  5.1   88     7   7
69    97     267  6.3   92     7   8
70    97     272  5.7   92     7   9
71    85     175  7.4   89     7  10
72    NA     139  8.6   82     7  11
73    10     264 14.3   73     7  12
74    27     175 14.9   81     7  13
75    NA     291 14.9   91     7  14
76     7      48 14.3   80     7  15
77    48     260  6.9   81     7  16
78    35     274 10.3   82     7  17
79    61     285  6.3   84     7  18
80    79     187  5.1   87     7  19
81    63     220 11.5   85     7  20
82    16       7  6.9   74     7  21
83    NA     258  9.7   81     7  22
84    NA     295 11.5   82     7  23
85    80     294  8.6   86     7  24
86   108     223  8.0   85     7  25
87    20      81  8.6   82     7  26
88    52      82 12.0   86     7  27
89    82     213  7.4   88     7  28
90    50     275  7.4   86     7  29
91    64     253  7.4   83     7  30
92    59     254  9.2   81     7  31

Subsetting

Or if we want all days except the 6th day, use !=

aq[(aq$Day != 6), ]
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21
22     11     320 16.6   73     5  22
23      4      25  9.7   61     5  23
24     32      92 12.0   61     5  24
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
28     23      13 12.0   67     5  28
29     45     252 14.9   81     5  29
30    115     223  5.7   79     5  30
31     37     279  7.4   76     5  31
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
38     29     127  9.7   82     6   7
39     NA     273  6.9   87     6   8
40     71     291 13.8   90     6   9
41     39     323 11.5   87     6  10
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
44     23     148  8.0   82     6  13
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
47     21     191 14.9   77     6  16
48     37     284 20.7   72     6  17
49     20      37  9.2   65     6  18
50     12     120 11.5   73     6  19
51     13     137 10.3   76     6  20
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
62    135     269  4.1   84     7   1
63     49     248  9.2   85     7   2
64     32     236  9.2   81     7   3
65     NA     101 10.9   84     7   4
66     64     175  4.6   83     7   5
68     77     276  5.1   88     7   7
69     97     267  6.3   92     7   8
70     97     272  5.7   92     7   9
71     85     175  7.4   89     7  10
72     NA     139  8.6   82     7  11
73     10     264 14.3   73     7  12
74     27     175 14.9   81     7  13
75     NA     291 14.9   91     7  14
76      7      48 14.3   80     7  15
77     48     260  6.9   81     7  16
78     35     274 10.3   82     7  17
79     61     285  6.3   84     7  18
80     79     187  5.1   87     7  19
81     63     220 11.5   85     7  20
82     16       7  6.9   74     7  21
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
85     80     294  8.6   86     7  24
86    108     223  8.0   85     7  25
87     20      81  8.6   82     7  26
88     52      82 12.0   86     7  27
89     82     213  7.4   88     7  28
90     50     275  7.4   86     7  29
91     64     253  7.4   83     7  30
92     59     254  9.2   81     7  31
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
106    65     157  9.7   80     8  14
107    NA      64 11.5   79     8  15
108    22      71 10.3   77     8  16
109    59      51  6.3   79     8  17
110    23     115  7.4   76     8  18
111    31     244 10.9   78     8  19
112    44     190 10.3   78     8  20
113    21     259 15.5   77     8  21
114     9      36 14.3   72     8  22
115    NA     255 12.6   75     8  23
116    45     212  9.7   79     8  24
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
130    20     252 10.9   80     9   7
131    23     220 10.3   78     9   8
132    21     230 10.9   75     9   9
133    24     259  9.7   73     9  10
134    44     236 14.9   81     9  11
135    21     259 15.5   76     9  12
136    28     238  6.3   77     9  13
137     9      24 10.9   71     9  14
138    13     112 11.5   71     9  15
139    46     237  6.9   78     9  16
140    18     224 13.8   67     9  17
141    13      27 10.3   76     9  18
142    24     238 10.3   68     9  19
143    16     201  8.0   82     9  20
144    13     238 12.6   64     9  21
145    23      14  9.2   71     9  22
146    36     139 10.3   81     9  23
147     7      49 10.3   69     9  24
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

Subsetting

  • We can combine logical conditions with & (and operator)
  • If we wanted only rows where the temperature was between 65 and 70 (including those numbers)
aq[(aq$Temp >= 65 & aq$Temp <= 70), ]
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
10     NA     194  8.6   69     5  10
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
17     34     307 12.0   66     5  17
19     30     322 11.5   68     5  19
28     23      13 12.0   67     5  28
34     NA     242 16.1   67     6   3
49     20      37  9.2   65     6  18
140    18     224 13.8   67     9  17
142    24     238 10.3   68     9  19
147     7      49 10.3   69     9  24
149    30     193  6.9   70     9  26
153    20     223 11.5   68     9  30

Subsetting

  • We can also use the or operator, |
  • If we only wanted rows on days 3 or 5
aq[(aq$Day == 3 | aq$Day == 5),]
    Ozone Solar.R Wind Temp Month Day
3      12     149 12.6   74     5   3
5      NA      NA 14.3   56     5   5
34     NA     242 16.1   67     6   3
36     NA     220  8.6   85     6   5
64     32     236  9.2   81     7   3
66     64     175  4.6   83     7   5
95     16      77  7.4   82     8   3
97     35      NA  7.4   85     8   5
126    73     183  2.8   93     9   3
128    47      95  7.4   87     9   5

Subsetting

  • You can also use the subset() function
  • The first argument is the data frame and the second argument is the logical expression
subset(aq, Temp > 80)
    Ozone Solar.R Wind Temp Month Day
29     45     252 14.9   81     5  29
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
38     29     127  9.7   82     6   7
39     NA     273  6.9   87     6   8
40     71     291 13.8   90     6   9
41     39     323 11.5   87     6  10
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
44     23     148  8.0   82     6  13
61     NA     138  8.0   83     6  30
62    135     269  4.1   84     7   1
63     49     248  9.2   85     7   2
64     32     236  9.2   81     7   3
65     NA     101 10.9   84     7   4
66     64     175  4.6   83     7   5
67     40     314 10.9   83     7   6
68     77     276  5.1   88     7   7
69     97     267  6.3   92     7   8
70     97     272  5.7   92     7   9
71     85     175  7.4   89     7  10
72     NA     139  8.6   82     7  11
74     27     175 14.9   81     7  13
75     NA     291 14.9   91     7  14
77     48     260  6.9   81     7  16
78     35     274 10.3   82     7  17
79     61     285  6.3   84     7  18
80     79     187  5.1   87     7  19
81     63     220 11.5   85     7  20
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
85     80     294  8.6   86     7  24
86    108     223  8.0   85     7  25
87     20      81  8.6   82     7  26
88     52      82 12.0   86     7  27
89     82     213  7.4   88     7  28
90     50     275  7.4   86     7  29
91     64     253  7.4   83     7  30
92     59     254  9.2   81     7  31
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
129    32      92 15.5   84     9   6
134    44     236 14.9   81     9  11
143    16     201  8.0   82     9  20
146    36     139 10.3   81     9  23

Subsetting

By using the select = parameter you can specify which columns to keep

subset(aq, Temp > 80, select = c(Ozone, Temp))
    Ozone Temp
29     45   81
35     NA   84
36     NA   85
38     29   82
39     NA   87
40     71   90
41     39   87
42     NA   93
43     NA   92
44     23   82
61     NA   83
62    135   84
63     49   85
64     32   81
65     NA   84
66     64   83
67     40   83
68     77   88
69     97   92
70     97   92
71     85   89
72     NA   82
74     27   81
75     NA   91
77     48   81
78     35   82
79     61   84
80     79   87
81     63   85
83     NA   81
84     NA   82
85     80   86
86    108   85
87     20   82
88     52   86
89     82   88
90     50   86
91     64   83
92     59   81
93     39   81
94      9   81
95     16   82
96     78   86
97     35   85
98     66   87
99    122   89
100    89   90
101   110   90
102    NA   92
103    NA   86
104    44   86
105    28   82
117   168   81
118    73   86
119    NA   88
120    76   97
121   118   94
122    84   96
123    85   94
124    96   91
125    78   92
126    73   93
127    91   93
128    47   87
129    32   84
134    44   81
143    16   82
146    36   81

Combining

Combining

  • There are two basic ways to combine data frames
  • The first is by sticking (or binding) two data frames together
  • The second is by merging

Combining

  • rbind() is a way to add rows to a data frame
  • You can think of it as stacking two data frames on top of each other
  • The restrictions are that both data frames must have the same number of columns, and the column names must match

Combining

Lets take two subsets of the air quality data frame and rbind them

aq1 <- aq[1:2, ]
aq2 <- aq[35:37, ]
aq.stacked <- rbind(aq1, aq2)
aq.stacked
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
35    NA     186  9.2   84     6   4
36    NA     220  8.6   85     6   5
37    NA     264 14.3   79     6   6

Combining

If the number of columns aren't the same, it won't work

aq1.chopped <- aq1[, 1:3]
aq1.chopped
  Ozone Solar.R Wind
1    41     190  7.4
2    36     118  8.0
rbind(aq1.chopped, aq2) # gives an error

Combining

If the column names aren't the same, it won't work

names(aq1)[1] <- "nuhuh"
aq1
  nuhuh Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
rbind(aq1, aq2) # gives an error

Combining

  • cbind() is a way to add columns to a data frame
  • The restriction is that both data frames must have the same number of rows

Combining

Now we'll split the data frame by columns and cbind them

aq.L <- aq[, 1:2]
aq.R <- aq[, 4:5]
aq.bound <- cbind(aq.L, aq.R)
aq.bound
    Ozone Solar.R Temp Month
1      41     190   67     5
2      36     118   72     5
3      12     149   74     5
4      18     313   62     5
5      NA      NA   56     5
6      28      NA   66     5
7      23     299   65     5
8      19      99   59     5
9       8      19   61     5
10     NA     194   69     5
11      7      NA   74     5
12     16     256   69     5
13     11     290   66     5
14     14     274   68     5
15     18      65   58     5
16     14     334   64     5
17     34     307   66     5
18      6      78   57     5
19     30     322   68     5
20     11      44   62     5
21      1       8   59     5
22     11     320   73     5
23      4      25   61     5
24     32      92   61     5
25     NA      66   57     5
26     NA     266   58     5
27     NA      NA   57     5
28     23      13   67     5
29     45     252   81     5
30    115     223   79     5
31     37     279   76     5
32     NA     286   78     6
33     NA     287   74     6
34     NA     242   67     6
35     NA     186   84     6
36     NA     220   85     6
37     NA     264   79     6
38     29     127   82     6
39     NA     273   87     6
40     71     291   90     6
41     39     323   87     6
42     NA     259   93     6
43     NA     250   92     6
44     23     148   82     6
45     NA     332   80     6
46     NA     322   79     6
47     21     191   77     6
48     37     284   72     6
49     20      37   65     6
50     12     120   73     6
51     13     137   76     6
52     NA     150   77     6
53     NA      59   76     6
54     NA      91   76     6
55     NA     250   76     6
56     NA     135   75     6
57     NA     127   78     6
58     NA      47   73     6
59     NA      98   80     6
60     NA      31   77     6
61     NA     138   83     6
62    135     269   84     7
63     49     248   85     7
64     32     236   81     7
65     NA     101   84     7
66     64     175   83     7
67     40     314   83     7
68     77     276   88     7
69     97     267   92     7
70     97     272   92     7
71     85     175   89     7
72     NA     139   82     7
73     10     264   73     7
74     27     175   81     7
75     NA     291   91     7
76      7      48   80     7
77     48     260   81     7
78     35     274   82     7
79     61     285   84     7
80     79     187   87     7
81     63     220   85     7
82     16       7   74     7
83     NA     258   81     7
84     NA     295   82     7
85     80     294   86     7
86    108     223   85     7
87     20      81   82     7
88     52      82   86     7
89     82     213   88     7
90     50     275   86     7
91     64     253   83     7
92     59     254   81     7
93     39      83   81     8
94      9      24   81     8
95     16      77   82     8
96     78      NA   86     8
97     35      NA   85     8
98     66      NA   87     8
99    122     255   89     8
100    89     229   90     8
101   110     207   90     8
102    NA     222   92     8
103    NA     137   86     8
104    44     192   86     8
105    28     273   82     8
106    65     157   80     8
107    NA      64   79     8
108    22      71   77     8
109    59      51   79     8
110    23     115   76     8
111    31     244   78     8
112    44     190   78     8
113    21     259   77     8
114     9      36   72     8
115    NA     255   75     8
116    45     212   79     8
117   168     238   81     8
118    73     215   86     8
119    NA     153   88     8
120    76     203   97     8
121   118     225   94     8
122    84     237   96     8
123    85     188   94     8
124    96     167   91     9
125    78     197   92     9
126    73     183   93     9
127    91     189   93     9
128    47      95   87     9
129    32      92   84     9
130    20     252   80     9
131    23     220   78     9
132    21     230   75     9
133    24     259   73     9
134    44     236   81     9
135    21     259   76     9
136    28     238   77     9
137     9      24   71     9
138    13     112   71     9
139    46     237   78     9
140    18     224   67     9
141    13      27   76     9
142    24     238   68     9
143    16     201   82     9
144    13     238   64     9
145    23      14   71     9
146    36     139   81     9
147     7      49   69     9
148    14      20   63     9
149    30     193   70     9
150    NA     145   77     9
151    14     191   75     9
152    18     131   76     9
153    20     223   68     9

Combining

If both data frames don't have the same number of rows, it won't work

cbind(aq.L[1:10, ], aq.R) # gives an error

Combining

  • merge() allows you to combine two data frames that may not have the same dimensions or column names
  • The two data frames are merged on the column names that both data frames have, or the columns to be merged on can be specified
  • See ?merge() for details

Combining

Here's a data frame with three air monitor locations

monitors <- read.table(header=T, text='
  monitorid        lat       long                 
          1  42.467573 -87.810047     
          2  42.049148 -88.273029
          3  39.110539 -90.324080
                       ')

Combining

And here's a data frame with pollutant and monitor information

pollutants <- read.table(header=T, text='
  pollutant   duration    monitorid                 
      ozone         1h            1
        so2         1h            1
      ozone         8h            2
        no2         1h            4
                       ')

Combining

  • Suppose you want to merge both data frames
merge(monitors, pollutants)
  monitorid   lat   long pollutant duration
1         1 42.47 -87.81     ozone       1h
2         1 42.47 -87.81       so2       1h
3         2 42.05 -88.27     ozone       8h
  • The default operation is an “inner join” on the two data frames
  • Only rows that have matched monitorid values are kept

Combining

If you wanted to keep all of the monitor locations (“left join”), you would use the all.x = T parameter

merge(monitors, pollutants, all.x = T)
  monitorid   lat   long pollutant duration
1         1 42.47 -87.81     ozone       1h
2         1 42.47 -87.81       so2       1h
3         2 42.05 -88.27     ozone       8h
4         3 39.11 -90.32      <NA>     <NA>

Combining

If you want to keep all of the rows in the pollutant data frame (“right join”), you can use all.y =

merge(monitors, pollutants, all.y = T)
  monitorid   lat   long pollutant duration
1         1 42.47 -87.81     ozone       1h
2         1 42.47 -87.81       so2       1h
3         2 42.05 -88.27     ozone       8h
4         4    NA     NA       no2       1h

Combining

If you wanted to keep all of the information, you can do an “outer join” using all = T

merge(monitors, pollutants, all = T)
  monitorid   lat   long pollutant duration
1         1 42.47 -87.81     ozone       1h
2         1 42.47 -87.81       so2       1h
3         2 42.05 -88.27     ozone       8h
4         3 39.11 -90.32      <NA>     <NA>
5         4    NA     NA       no2       1h

Reshaping

Reshaping

  • Sometimes a data frame is in a “wide” format, but we want it in a “long”“ format (or vice versa)
  • The easiest way to do this is to use the reshape2 package

Reshaping

  • Here's a data frame in a wide format
  • We have a few days of monitor values for sevaral pollutants (and temperature) from one monitor
wide <- read.table(header=T, text='
  day  temp  ozone  so2  no2
    1    75     50   10    2
    2    81     61   25   12
    3    68     42   22    3           
                         ')

Reshaping

  • We may not want all of the measurements to be in separate columns
  • If we want all of the values to be in one column, with another column indicating what pollutant or met parameter is being measured, we would use the melt() function from the reshape2 package
library(reshape2)
melt(wide, id.vars = c("day"))

Reshaping

   day variable value
1    1     temp    75
2    2     temp    81
3    3     temp    68
4    1    ozone    50
5    2    ozone    61
6    3    ozone    42
7    1      so2    10
8    2      so2    25
9    3      so2    22
10   1      no2     2
11   2      no2    12
12   3      no2     3

Reshaping

If we had a table with more than one monitor, with a separate monitor column, we would include it as an id variable

wide <- read.table(header=T, text='
  day  monitor temp  ozone  so2  no2
    1        1   75     50   10    2
    2        1   81     61   25   12
    3        1   68     42   22    3  
    1        2   72     51   10    3
    2        2   85     63   25   13
    3        2   66     45   22    2 
                         ')

Reshaping

melt(wide, id.vars = c("day", "monitor"))
   day monitor variable value
1    1       1     temp    75
2    2       1     temp    81
3    3       1     temp    68
4    1       2     temp    72
5    2       2     temp    85
6    3       2     temp    66
7    1       1    ozone    50
8    2       1    ozone    61
9    3       1    ozone    42
10   1       2    ozone    51
11   2       2    ozone    63
12   3       2    ozone    45
13   1       1      so2    10
14   2       1      so2    25
15   3       1      so2    22
16   1       2      so2    10
17   2       2      so2    25
18   3       2      so2    22
19   1       1      no2     2
20   2       1      no2    12
21   3       1      no2     3
22   1       2      no2     3
23   2       2      no2    13
24   3       2      no2     2

Reshaping

  • To take a long data frame and convert it to a wide data frame, we use the dcast() function
  • We'll use the long data frame we just created
long <- melt(wide, id.vars = c("day", "monitor"))

Reshaping

  • We want to keep the columns day and monitor the same
  • variable is the column that will contain the names of the new columns
  • value contains the measurements

Reshaping

dcast(long, day + monitor ~ variable, value.var="value")
  day monitor temp ozone so2 no2
1   1       1   75    50  10   2
2   1       2   72    51  10   3
3   2       1   81    61  25  12
4   2       2   85    63  25  13
5   3       1   68    42  22   3
6   3       2   66    45  22   2

R Slides