In R there are usually many ways to perform a task. The ones that have been developed more recently are generally easier, having been developed to avoid the difficulty of the earlier ways. Subsetting a dataframe is a common task.

The bracket operator

The original method is to use the bracket operator.

NewDF = OldDF[Rows I want, Columns I want]

Columns I want can be specified as a vector of column numbers or as a vector of column names enclosed in quotes.

Rows I want is usually expressed as a logical expression. Let’s take a look at the mtcars dataframe, which is always installed in base R inside the datasets package. First do an str() and a summary() to look at it.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

So suppose I want a smaller dataframe which contains only cars with mpg > 20 and contains only the columns mpg, cyl and disp.

SmallDF = mtcars[mtcars$mpg > 20, c("mpg","cyl","disp")]
# let's look at it
SmallDF
##                 mpg cyl  disp
## Mazda RX4      21.0   6 160.0
## Mazda RX4 Wag  21.0   6 160.0
## Datsun 710     22.8   4 108.0
## Hornet 4 Drive 21.4   6 258.0
## Merc 240D      24.4   4 146.7
## Merc 230       22.8   4 140.8
## Fiat 128       32.4   4  78.7
## Honda Civic    30.4   4  75.7
## Toyota Corolla 33.9   4  71.1
## Toyota Corona  21.5   4 120.1
## Fiat X1-9      27.3   4  79.0
## Porsche 914-2  26.0   4 120.3
## Lotus Europa   30.4   4  95.1
## Volvo 142E     21.4   4 121.0

You can use column numbers instead of names.

SmallDF = mtcars[mtcars$mpg > 20, c(1,2,3)]
# let's look at it
SmallDF
##                 mpg cyl  disp
## Mazda RX4      21.0   6 160.0
## Mazda RX4 Wag  21.0   6 160.0
## Datsun 710     22.8   4 108.0
## Hornet 4 Drive 21.4   6 258.0
## Merc 240D      24.4   4 146.7
## Merc 230       22.8   4 140.8
## Fiat 128       32.4   4  78.7
## Honda Civic    30.4   4  75.7
## Toyota Corolla 33.9   4  71.1
## Toyota Corona  21.5   4 120.1
## Fiat X1-9      27.3   4  79.0
## Porsche 914-2  26.0   4 120.3
## Lotus Europa   30.4   4  95.1
## Volvo 142E     21.4   4 121.0

Note that the logical expression mtcars$mpg > 20 defines a vector of logical values. You could do this separately in advance and use the logical vector to specify the rows you want.

RowsIWant = mtcars$mpg > 20
# Let's look at it
RowsIWant
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
# Now use it to get the rows you want
SmallDF = mtcars[RowsIWant,c(1,2,3)]
# Look at it
SmallDF
##                 mpg cyl  disp
## Mazda RX4      21.0   6 160.0
## Mazda RX4 Wag  21.0   6 160.0
## Datsun 710     22.8   4 108.0
## Hornet 4 Drive 21.4   6 258.0
## Merc 240D      24.4   4 146.7
## Merc 230       22.8   4 140.8
## Fiat 128       32.4   4  78.7
## Honda Civic    30.4   4  75.7
## Toyota Corolla 33.9   4  71.1
## Toyota Corona  21.5   4 120.1
## Fiat X1-9      27.3   4  79.0
## Porsche 914-2  26.0   4 120.3
## Lotus Europa   30.4   4  95.1
## Volvo 142E     21.4   4 121.0

Very frequently, you will want all of the columns. In this case, you need to include the comma, but follow it with nothing.

SmallDF = mtcars[RowsIWant,]
SmallDF
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

If you leave out the comma, you’ll get an error message. Try it for yourself and note the error message you get. You will forget that comma many times in the near future. Making note of the error message now will help you to recover quickly.