Simple tutorial to remind myself the intricacies of data.frame manipulations

c <- data.frame(foo=c(1,NA,2,3,4,6,7,8,1,2),bar=c(T,F,T,F,T,F,T,NA,T,T),norf=c('a','b','c','d',NA,'f','g','h',NA,'i'))
c
##    foo   bar norf
## 1    1  TRUE    a
## 2   NA FALSE    b
## 3    2  TRUE    c
## 4    3 FALSE    d
## 5    4  TRUE <NA>
## 6    6 FALSE    f
## 7    7  TRUE    g
## 8    8    NA    h
## 9    1  TRUE <NA>
## 10   2  TRUE    i

The above data.frame has 3 columns, 1st column is numeric, 2nd column is T/F, and 3rd column is characters. There are some missing values (NA) scattered among the columns in this example data.frame.

cc <- complete.cases(c)
cc
##  [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

The function complete.cases() looks thru every row and set the row to FALSE if any of the column(s) has NA in them.

c[cc,]
##    foo   bar norf
## 1    1  TRUE    a
## 3    2  TRUE    c
## 4    3 FALSE    d
## 6    6 FALSE    f
## 7    7  TRUE    g
## 10   2  TRUE    i

Subsetting the data.frame by rows using the result from the complete.cases() function, note the important ‘,’ character after the vector containing the complete cases.

c[c$foo>3,]
##    foo   bar norf
## NA  NA    NA <NA>
## 5    4  TRUE <NA>
## 6    6 FALSE    f
## 7    7  TRUE    g
## 8    8    NA    h

Another example of subsetting a data.frame is to use named variable by specifying the variable using the \(variable name, note the \)variable need to be prefixed with the name of the data.frame.

c[c$foo>3,][,1:2]
##    foo   bar
## NA  NA    NA
## 5    4  TRUE
## 6    6 FALSE
## 7    7  TRUE
## 8    8    NA

You can “chain” subset operations one after another, the above example first subset the foo column, and apply another subset to the result bu only taking the first two columns of the results from the previous subset operation. The range 1:2 after the ‘,’ operator specifies we want the first two columns of the results.

c[c$foo>3,][1:2,]
##    foo  bar norf
## NA  NA   NA <NA>
## 5    4 TRUE <NA>

If the range 1:2 is before the ‘,’ operator, you get the first two rows of the results instead.