Data Frames and Subsetting in R

Let's just put together a mini data set:

name = c("john", "sue", "terrence", "amy.Smith")
height = c(68, 62, 66, 70)
weight = c(170, 110, 155, NA)
sex = c("Male", "Female", "Male", "Female")
data = data.frame(nAme = name, height = height, Weight = weight, sex = sex)
summary(data)

##         nAme       height         Weight        sex   
##  amy.Smith:1   Min.   :62.0   Min.   :110   Female:2  
##  john     :1   1st Qu.:65.0   1st Qu.:132   Male  :2  
##  sue      :1   Median :67.0   Median :155             
##  terrence :1   Mean   :66.5   Mean   :145             
##                3rd Qu.:68.5   3rd Qu.:162             
##                Max.   :70.0   Max.   :170             
##                               NA's   :1

data is a data.frame and contains information about names, heights and weights.

data

##        nAme height Weight    sex
## 1      john     68    170   Male
## 2       sue     62    110 Female
## 3  terrence     66    155   Male
## 4 amy.Smith     70     NA Female

Oops, we've misnamed some of the columns; let's fix that.

names(data) <- c("name", "height", "weight", "sex")
names(data)

## [1] "name"   "height" "weight" "sex"

data

##        name height weight    sex
## 1      john     68    170   Male
## 2       sue     62    110 Female
## 3  terrence     66    155   Male
## 4 amy.Smith     70     NA Female

If we attach the dataset to the working memory we can access the variables directly, e.g. by calling name rather than by calling data$name.

attach(data)

## The following object(s) are masked _by_ '.GlobalEnv':
## 
##     height, name, sex, weight

Amy seems to have a last name. I'd prefer to remove that.

splitnames <- strsplit(name, "\\.")
splitnames

## [[1]]
## [1] "john"
## 
## [[2]]
## [1] "sue"
## 
## [[3]]
## [1] "terrence"
## 
## [[4]]
## [1] "amy"   "Smith"

firstelt <- function(x) {
    x[1]
}  # firstelt is now a function taking a list x and returning the first element of x
newnames <- sapply(splitnames, firstelt)  # sapply(list,function) applies the function to each element of the list
newnames

## [1] "john"     "sue"      "terrence" "amy"

name <- newnames
data

##        name height weight    sex
## 1      john     68    170   Male
## 2       sue     62    110 Female
## 3  terrence     66    155   Male
## 4 amy.Smith     70     NA Female

Let's break the data into two sets depending on whether the person is male or female.

men = data[sex == "Male", ]  # Use == to get logical =
# We are conditioning on the rows and the columns don't have any condition
# (so no condition after the ,)
men

##       name height weight  sex
## 1     john     68    170 Male
## 3 terrence     66    155 Male

women = data[!(sex == "Male"), ]
women

##        name height weight    sex
## 2       sue     62    110 Female
## 4 amy.Smith     70     NA Female

Let's randomly sample part of our dataset and then split the data on that sample. You can use this method to partition your data set into create testing, training and validation sets. First we sample 2 rows:

n = length(name)
n

## [1] 4

set.seed(1002)
samp = sample(1:n, 2, replace = FALSE)
samp

## [1] 2 3

Now we split:

data.1 <- data[samp, ]
data.1

##       name height weight    sex
## 2      sue     62    110 Female
## 3 terrence     66    155   Male

data.2 <- data[-samp, ]
data.2

##        name height weight    sex
## 1      john     68    170   Male
## 4 amy.Smith     70     NA Female

Now I've decided that I actually don't want anyone in my data set with an NA (missing observation):

any(is.na(name))

## [1] FALSE

check.if.NA <- function(var) {
    any(is.na(var))
}
check.if.NA(name)

## [1] FALSE

check.if.NA(weight)

## [1] TRUE

apply(data, 2, check.if.NA)  # apply applies the function check.if.NA to each column of the dataframe data. Note that replacing the 2 by a 1 applies check.if.NA to each row of the dataframe.

##   name height weight    sex 
##  FALSE  FALSE   TRUE  FALSE

I know that there is a problem with weight. Let's check who is the problem:

name[which(is.na(weight))]

## [1] "amy"

# or use:
data[which(is.na(weight)), ]

##        name height weight    sex
## 4 amy.Smith     70     NA Female

new.data = data[-which(is.na(weight)), ]
new.data

##       name height weight    sex
## 1     john     68    170   Male
## 2      sue     62    110 Female
## 3 terrence     66    155   Male