Let's just put together a mini data set:
name = c("john", "sue", "terrence", "amy.Smith")
height = c(68, 62, 66, 70)
weight = c(170, 110, 155, NA)
sex = c("Male", "Female", "Male", "Female")
data = data.frame(nAme = name, height = height, Weight = weight, sex = sex)
summary(data)
## nAme height Weight sex
## amy.Smith:1 Min. :62.0 Min. :110 Female:2
## john :1 1st Qu.:65.0 1st Qu.:132 Male :2
## sue :1 Median :67.0 Median :155
## terrence :1 Mean :66.5 Mean :145
## 3rd Qu.:68.5 3rd Qu.:162
## Max. :70.0 Max. :170
## NA's :1
data is a data.frame and contains information about names, heights and weights.
data
## nAme height Weight sex
## 1 john 68 170 Male
## 2 sue 62 110 Female
## 3 terrence 66 155 Male
## 4 amy.Smith 70 NA Female
Oops, we've misnamed some of the columns; let's fix that.
names(data) <- c("name", "height", "weight", "sex")
names(data)
## [1] "name" "height" "weight" "sex"
data
## name height weight sex
## 1 john 68 170 Male
## 2 sue 62 110 Female
## 3 terrence 66 155 Male
## 4 amy.Smith 70 NA Female
If we attach the dataset to the working memory we can access the variables directly, e.g. by calling name rather than by calling data$name.
attach(data)
## The following object(s) are masked _by_ '.GlobalEnv':
##
## height, name, sex, weight
Amy seems to have a last name. I'd prefer to remove that.
splitnames <- strsplit(name, "\\.")
splitnames
## [[1]]
## [1] "john"
##
## [[2]]
## [1] "sue"
##
## [[3]]
## [1] "terrence"
##
## [[4]]
## [1] "amy" "Smith"
firstelt <- function(x) {
x[1]
} # firstelt is now a function taking a list x and returning the first element of x
newnames <- sapply(splitnames, firstelt) # sapply(list,function) applies the function to each element of the list
newnames
## [1] "john" "sue" "terrence" "amy"
name <- newnames
data
## name height weight sex
## 1 john 68 170 Male
## 2 sue 62 110 Female
## 3 terrence 66 155 Male
## 4 amy.Smith 70 NA Female
Let's break the data into two sets depending on whether the person is male or female.
men = data[sex == "Male", ] # Use == to get logical =
# We are conditioning on the rows and the columns don't have any condition
# (so no condition after the ,)
men
## name height weight sex
## 1 john 68 170 Male
## 3 terrence 66 155 Male
women = data[!(sex == "Male"), ]
women
## name height weight sex
## 2 sue 62 110 Female
## 4 amy.Smith 70 NA Female
Let's randomly sample part of our dataset and then split the data on that sample. You can use this method to partition your data set into create testing, training and validation sets. First we sample 2 rows:
n = length(name)
n
## [1] 4
set.seed(1002)
samp = sample(1:n, 2, replace = FALSE)
samp
## [1] 2 3
Now we split:
data.1 <- data[samp, ]
data.1
## name height weight sex
## 2 sue 62 110 Female
## 3 terrence 66 155 Male
data.2 <- data[-samp, ]
data.2
## name height weight sex
## 1 john 68 170 Male
## 4 amy.Smith 70 NA Female
Now I've decided that I actually don't want anyone in my data set with an NA (missing observation):
any(is.na(name))
## [1] FALSE
check.if.NA <- function(var) {
any(is.na(var))
}
check.if.NA(name)
## [1] FALSE
check.if.NA(weight)
## [1] TRUE
apply(data, 2, check.if.NA) # apply applies the function check.if.NA to each column of the dataframe data. Note that replacing the 2 by a 1 applies check.if.NA to each row of the dataframe.
## name height weight sex
## FALSE FALSE TRUE FALSE
I know that there is a problem with weight. Let's check who is the problem:
name[which(is.na(weight))]
## [1] "amy"
# or use:
data[which(is.na(weight)), ]
## name height weight sex
## 4 amy.Smith 70 NA Female
new.data = data[-which(is.na(weight)), ]
new.data
## name height weight sex
## 1 john 68 170 Male
## 2 sue 62 110 Female
## 3 terrence 66 155 Male