Mike McCann
22-23 January 2015
A data frame is a very important data type in R.
It's pretty much the de facto data structure for most tabular data and what we use for statistics.
Data frames are usually read in from a file, but R comes with many practice datasets.
We will use the iris dataset, famously used by Fisher in 1936 (http://en.wikipedia.org/wiki/Iris_flower_data_set)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 1 | 5.10 | 3.50 | 1.40 | 0.20 | setosa |
| 2 | 4.90 | 3.00 | 1.40 | 0.20 | setosa |
| 3 | 4.70 | 3.20 | 1.30 | 0.20 | setosa |
| 4 | 4.60 | 3.10 | 1.50 | 0.20 | setosa |
| 5 | 5.00 | 3.60 | 1.40 | 0.20 | setosa |
| 6 | 5.40 | 3.90 | 1.70 | 0.40 | setosa |
head() - see first 5 rowstail() - see last 5 rowsdim() - dimensions (# rows, # columns)nrow() - number of rowsncol() - number of columnsstr() - structure of any objectclass() - class of any objectrownames() - row names colnames() - column namesHow many rows does the iris data have?
How many columns? What are the column names?
Using the str() function, how many species are in the data?
What classes are each of the columns?
R has many powerful subset operators and mastering them will allow you to easily perform complex operation on any kind of dataset. Dataframes are akin to a series of vectors combined into a tabular structure.
head(iris)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 1 | 5.10 | 3.50 | 1.40 | 0.20 | setosa |
| 2 | 4.90 | 3.00 | 1.40 | 0.20 | setosa |
| 3 | 4.70 | 3.20 | 1.30 | 0.20 | setosa |
| 4 | 4.60 | 3.10 | 1.50 | 0.20 | setosa |
| 5 | 5.00 | 3.60 | 1.40 | 0.20 | setosa |
| 6 | 5.40 | 3.90 | 1.70 | 0.40 | setosa |
Index individual elements of a data frame with nameofdf[row#, column#]
iris[1,1] # first row, first column
[1] 5.1
iris[3,3] # third row, third column
[1] 1.3
Get entire rows or columns
iris[2,] # the second row
iris[,3] # the third column
Columns can also be called by their name
iris[,"Sepal.Length"]
Use the dollar sign $ to call a column
iris$Sepal.Length
Three ways to get the first column Sepal.Length.
iris[,1]
iris[,"Sepal.Length"]
iris$Sepal.Length
Get the 5th, 7th, and 9th rows and the first two columns.
iris[c(5,7,9), 1:2]
Sepal.Length Sepal.Width
5 5.0 3.6
7 4.6 3.4
9 4.4 2.9
What is the 9th entry of the Sepal.Width column? Call it x.
How would you get all entries of iris for the 17th row?
Return an object with the 1st, 4th and 7th rows of the iris dataframe?
Use the seq() function to get all odd rows in the iris dataset?
What happens when you use negative numbers? Hint: Use dim() on the original and final objects.
An extremely helpful tool is to subset your data using logic rather than an index.
Let's begin with continious numeric data.
x <- iris$Petal.Width # Grab the Sepal.Width
hist(x) # Make a histogram
x <- iris$Petal.Width
logi <- x > 1 # Which values are greater than 1?
a <- iris[logi,] # Records where Petal.Width >1
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
x <- iris$Petal.Width
logi <- x > 1 # Which values are greater than 1?
a <- iris[logi,] # Records where Petal.Width >1
This is the same as:
iris[iris$Petal.Width > 1, ]
Return a different column based on the subset
x <- iris$Petal.Width
logi <- x > 1 # Which values are greater than 1?
# Return all values in Petal.Length where Sepal.Width is greater than 1.
iris[logi,"Petal.Length"]
Why is iris[iris>3,] a nonsensical command?
What about iris[iris$Sepal.Length>3]?
Create a histogram of petal lengths for the entire data.
Subset the data for petal lengths greater than two.
Create a histogram of your new data.
Use the data.frame function.
df <- data.frame(x=1:5, y=6:2)
df
x y
1 1 6
2 2 5
3 3 4
4 4 3
5 5 2
Use the assignment operator <-
df$z <- 41:45
df
x y z
1 1 6 41
2 2 5 42
3 3 4 43
4 4 3 44
5 5 2 45
rm() does not work to remove a single column of a data frame.
df$z <- NULL
df
x y
1 1 6
2 2 5
3 3 4
4 4 3
5 5 2
Assign NULL to the column. This is not reversible.
Many different types of logical statements can be used to subset data.
For all types of data, if we want a specific value/character/factor we use ==
Only4 <- iris[iris$Petal.Length == 4,]
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 63 | 6.00 | 2.20 | 4.00 | 1.00 | versicolor |
| 72 | 6.10 | 2.80 | 4.00 | 1.30 | versicolor |
| 90 | 5.50 | 2.50 | 4.00 | 1.30 | versicolor |
| 93 | 5.80 | 2.60 | 4.00 | 1.20 | versicolor |
versicolor_only <- iris[iris$Species == "versicolor",]
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 54 | 5.50 | 2.30 | 4.00 | 1.30 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
Use & (AND) if you want both statements to be true.
Use | (OR) if you want either statement to be true.
subsetX <- iris[iris$Petal.Length > 4 & iris$Species == "versicolor",]
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 51 | 7.00 | 3.20 | 4.70 | 1.40 | versicolor |
| 52 | 6.40 | 3.20 | 4.50 | 1.50 | versicolor |
| 53 | 6.90 | 3.10 | 4.90 | 1.50 | versicolor |
| 55 | 6.50 | 2.80 | 4.60 | 1.50 | versicolor |
| 56 | 5.70 | 2.80 | 4.50 | 1.30 | versicolor |
| 57 | 6.30 | 3.30 | 4.70 | 1.60 | versicolor |
Explain in words each of the following logical statements
iris[1:4,]
iris[c(1:15),c(1,3)]
iris[iris$Species == "setosa", "Petal.Width"]
What happens when you add a ! before a logical statment?
Hint: Compare
iris[iris$Species == "setosa",]
iris[!iris$Species == "setosa",]
Similar to data frames.
Two-dimensional and only consists of numbers.
Some functions require a matrix as an input.
a <- matrix(1:9, ncol=3, nrow=3)
Similar indexing as a data frame
a <- matrix(1:9, ncol=3, nrow=3)
a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
a[2,2]
[1] 5
a[2,]
[1] 2 5 8
R is not a spreadsheet program, so it is not great for direct data entry.
Therefore, for many analyses, you will want to import data.
.csv files are often the preferred format.
Before we do that, we will need to look at concept of your working directory.
Find out what your current working directory is
getwd()
[1] "C:/Users/Mike/Desktop/Dropbox/R_course/Mike"
This is the folder on your computer where R will look to open or write files.
Set your working directory
getwd()
setwd("C:/Users/Mike/Desktop/Dropbox/R_course/data")
# relative paths also work
setwd("./data")
See everything currently in your working directory
list.files()
Manually move the climate data that you downloaded to your working directory.
data01 <- read.csv("climate01.csv")
head(data01)
read.table() is an alternative function for importing data.
One of the largest sources of frustration with R can be importing data. Variable names often cause problems.
- Do not have spaces in variable names
- Use lower case letters
- Abbreviate when possible
Average Height # BAD
Average.Height # OK
average.height # BETTER
avg.height # GOOD!
# write the file
write.csv(iris,"iris.csv", row.names=FALSE)
# check and see
list.files()