# Draw a 100 points from a random normal of 100 with mean =6, sd =7
x <- rnorm(100, 6, 7)
# view distribution
hist(x)
# sample 5 without replacement
sample(x, 5)
## [1] 1.856 8.253 8.575 18.531 5.367
Ben Weinstein
# Draw a 100 points from a random normal of 100 with mean =6, sd =7
x <- rnorm(100, 6, 7)
# view distribution
hist(x)
# sample 5 without replacement
sample(x, 5)
## [1] 1.856 8.253 8.575 18.531 5.367
A data frame is a very important data type in R. It's pretty much the de facto data structure for most tabular data and what we use for statistics. Data frames can have additional attributes such as rownames() and colnames(). This can be useful for annotating data.
Useful functions
Data frames are usually read in from file, but R comes with many practice datasets. We will use the iris dataset, famously used by Fisher in 1936 (http://en.wikipedia.org/wiki/Iris_flower_data_set)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 1 | 5.10 | 3.50 | 1.40 | 0.20 | setosa |
| 2 | 4.90 | 3.00 | 1.40 | 0.20 | setosa |
| 3 | 4.70 | 3.20 | 1.30 | 0.20 | setosa |
| 4 | 4.60 | 3.10 | 1.50 | 0.20 | setosa |
| 5 | 5.00 | 3.60 | 1.40 | 0.20 | setosa |
| 6 | 5.40 | 3.90 | 1.70 | 0.40 | setosa |
Try the basic dataframe functions:
- 1. head()
- 2. tail() - see last 5 rows
- 3. dim() - see dimensions
- 4. nrow() - number of rows
- 5. ncol() - number of columns
- 6. str() - structure of the object
R has many powerful subset operators and mastering them will allow you to easily perform complex operation on any kind of dataset. Allows you to manipulate data very succinctly. Dataframes are akin to a series of vectors combined into a tabular structure.
head(iris)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 1 | 5.10 | 3.50 | 1.40 | 0.20 | setosa |
| 2 | 4.90 | 3.00 | 1.40 | 0.20 | setosa |
| 3 | 4.70 | 3.20 | 1.30 | 0.20 | setosa |
| 4 | 4.60 | 3.10 | 1.50 | 0.20 | setosa |
| 5 | 5.00 | 3.60 | 1.40 | 0.20 | setosa |
| 6 | 5.40 | 3.90 | 1.70 | 0.40 | setosa |
Use the $ sign to call a column
x <- iris$Sepal.Length
head(x)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
y <- iris[, "Sepal.Length"]
head(y)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
Indexing a dataframe is given by nameofdf[rows,columns]
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
# Identical to iris$Sepal.Length
SL <- iris[, "Sepal.Length"]
head(SL)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
SL2 <- iris[, 1]
Get all rows and the first two columns a<-iris[,1:2]
| Sepal.Length | Sepal.Width | |
|---|---|---|
| 1 | 5.10 | 3.50 |
| 2 | 4.90 | 3.00 |
| 3 | 4.70 | 3.20 |
| 4 | 4.60 | 3.10 |
| 5 | 5.00 | 3.60 |
| 6 | 5.40 | 3.90 |
An extremely helpful tool is to subset your data using logic rather than an index. Let's begin with continious numeric data.
# Grab the Sepal.Width
x <- iris$Petal.Width
# Histogram
hist(x)
# Which values are greater than 1?
x <- iris$Petal.Width
logi <- x > 1
# Return all columns based on iris$Petal.Width
a <- iris[logi, ]
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
# Return all columns based on iris$Petal.Width values are greater than 1?
a <- iris[iris$Petal.Width > 1, ]
a <- iris[, 1:2]
# Return the same column as was subsetted.
head(x[logi])
## [1] 1.4 1.5 1.5 1.3 1.5 1.3
Return a different column based on the subset
# Return all values in Sepal.Length where Sepal.Width is greater than 1.
head(iris[logi, "Petal.Length"])
## [1] 4.7 4.5 4.9 4.0 4.6 4.5
Many different types of logical statements can be used to subset data.
For all types of data, if we want a specific value/character/factor we use ==
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 63 | 6.00 | 2.20 | 4.00 | 1.00 | versicolor |
| 72 | 6.10 | 2.80 | 4.00 | 1.30 | versicolor |
| 90 | 5.50 | 2.50 | 4.00 | 1.30 | versicolor |
| 93 | 5.80 | 2.60 | 4.00 | 1.20 | versicolor |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
Combined using & (and) if you want both statements to be true, or | (or) if you want either statement to be true.
subsetX <- iris[iris$Petal.Length > 4 & iris$Species == "versicolor", ]
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
| 57 | 6.30 | 3.30 | 4.70 | 1.60 | versicolor |
Explain in words each of the following logical statements
iris[1:4,]
iris[c(1:15,),c(1,3)]
iris[iris$Species == "setosa","Petal.Width"]
What happens when you add a ! before a logical statment,
Hint compare: iris[iris$Species == "setosa",] with iris[!iris$Species == "setosa",]
Properties of a Data frame
Indexing columns
Subsetting Data
Logical Statements