Data Manipulation

Mike McCann
22-23 January 2015

Download files for this section

climate01.csv

climate02.csv

Beyond Vectors: Data frames

A data frame is a very important data type in R.

It's pretty much the de facto data structure for most tabular data and what we use for statistics.

Example data frame

Data frames are usually read in from a file, but R comes with many practice datasets.

We will use the iris dataset, famously used by Fisher in 1936 (http://en.wikipedia.org/wiki/Iris_flower_data_set)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.10	3.50	1.40	0.20	setosa
2	4.90	3.00	1.40	0.20	setosa
3	4.70	3.20	1.30	0.20	setosa
4	4.60	3.10	1.50	0.20	setosa
5	5.00	3.60	1.40	0.20	setosa
6	5.40	3.90	1.70	0.40	setosa

Useful functions for data frames

head() - see first 5 rows
tail() - see last 5 rows
dim() - dimensions (# rows, # columns)
nrow() - number of rows
ncol() - number of columns
str() - structure of any object
class() - class of any object
rownames() - row names
colnames() - column names

Try It!

How many rows does the iris data have?
How many columns? What are the column names?
Using the str() function, how many species are in the data?
What classes are each of the columns?

Dataframe syntax and subsetting

R has many powerful subset operators and mastering them will allow you to easily perform complex operation on any kind of dataset. Dataframes are akin to a series of vectors combined into a tabular structure.

head(iris)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.10	3.50	1.40	0.20	setosa
2	4.90	3.00	1.40	0.20	setosa
3	4.70	3.20	1.30	0.20	setosa
4	4.60	3.10	1.50	0.20	setosa
5	5.00	3.60	1.40	0.20	setosa
6	5.40	3.90	1.70	0.40	setosa

Indexing

Index individual elements of a data frame with nameofdf[row#, column#]

iris[1,1] # first row, first column

[1] 5.1

iris[3,3] # third row, third column

[1] 1.3

Indexing

Get entire rows or columns

iris[2,] # the second row 
iris[,3] # the third column

Indexing

Columns can also be called by their name

iris[,"Sepal.Length"]

Indexing

Use the dollar sign $ to call a column

iris$Sepal.Length

Indexing

Three ways to get the first column Sepal.Length.

iris[,1]

iris[,"Sepal.Length"]

iris$Sepal.Length

Dataframes can be indexed for both rows and columns

Get the 5th, 7th, and 9th rows and the first two columns.

iris[c(5,7,9), 1:2]

  Sepal.Length Sepal.Width
5          5.0         3.6
7          4.6         3.4
9          4.4         2.9

Try It!

What is the 9th entry of the Sepal.Width column? Call it x.
How would you get all entries of iris for the 17th row?
Return an object with the 1st, 4th and 7th rows of the iris dataframe?
Use the seq() function to get all odd rows in the iris dataset?
What happens when you use negative numbers? Hint: Use dim() on the original and final objects.

Using logical statements

An extremely helpful tool is to subset your data using logic rather than an index.

Let's begin with continious numeric data.

x <- iris$Petal.Width # Grab the Sepal.Width

hist(x) # Make a histogram

plot of chunk unnamed-chunk-10

Subset Using Statements

x <- iris$Petal.Width

logi <- x > 1 # Which values are greater than 1?

a <- iris[logi,] # Records where Petal.Width >1

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
54	5.50	2.30	4.00	1.30	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor

Subset Using Statements

x <- iris$Petal.Width

logi <- x > 1 # Which values are greater than 1?

a <- iris[logi,] # Records where Petal.Width >1

This is the same as:

iris[iris$Petal.Width > 1, ]

More Subsets

Return a different column based on the subset

x <- iris$Petal.Width

logi <- x > 1 # Which values are greater than 1?

# Return all values in Petal.Length where Sepal.Width is greater than 1. 

iris[logi,"Petal.Length"]

Try It!

Why is iris[iris>3,] a nonsensical command?
What about iris[iris$Sepal.Length>3]?
Create a histogram of petal lengths for the entire data.
Subset the data for petal lengths greater than two.
Create a histogram of your new data.

Build a data frame from scratch

Use the data.frame function.

df <- data.frame(x=1:5, y=6:2)
df

Add columns to data frames

Use the assignment operator <-

df$z <- 41:45
df

Remove columns from data frames

rm() does not work to remove a single column of a data frame.

df$z <- NULL
df

Assign NULL to the column. This is not reversible.

Other types of logical statements

Many different types of logical statements can be used to subset data.

For all types of data, if we want a specific value/character/factor we use ==

Only4 <- iris[iris$Petal.Length == 4,]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
54	5.50	2.30	4.00	1.30	versicolor
63	6.00	2.20	4.00	1.00	versicolor
72	6.10	2.80	4.00	1.30	versicolor
90	5.50	2.50	4.00	1.30	versicolor
93	5.80	2.60	4.00	1.20	versicolor

Get only records from the species versicolor

versicolor_only <- iris[iris$Species == "versicolor",]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
54	5.50	2.30	4.00	1.30	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor

Combine logical statments

Use & (AND) if you want both statements to be true.
Use | (OR) if you want either statement to be true.

subsetX <- iris[iris$Petal.Length > 4 & iris$Species == "versicolor",]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor
57	6.30	3.30	4.70	1.60	versicolor

Try It!

Explain in words each of the following logical statements

iris[1:4,]
iris[c(1:15),c(1,3)]
iris[iris$Species == "setosa", "Petal.Width"]
What happens when you add a ! before a logical statment?

Hint: Compare
iris[iris$Species == "setosa",]
iris[!iris$Species == "setosa",]

Matrices

Similar to data frames.

Two-dimensional and only consists of numbers.

Some functions require a matrix as an input.

a <- matrix(1:9, ncol=3, nrow=3)

Matrices

Similar indexing as a data frame

a <- matrix(1:9, ncol=3, nrow=3)
a

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

a[2,2]

[1] 5

a[2,]

[1] 2 5 8

Importing your own data

R is not a spreadsheet program, so it is not great for direct data entry.

Therefore, for many analyses, you will want to import data.

.csv files are often the preferred format.

Before we do that, we will need to look at concept of your working directory.

Working directory

Find out what your current working directory is

getwd()

[1] "C:/Users/Mike/Desktop/Dropbox/R_course/Mike"

This is the folder on your computer where R will look to open or write files.

Working directory

Set your working directory

getwd()

setwd("C:/Users/Mike/Desktop/Dropbox/R_course/data")

# relative paths also work 
setwd("./data")

Working directory

See everything currently in your working directory

list.files()

Importing your own data

Manually move the climate data that you downloaded to your working directory.

data01 <- read.csv("climate01.csv")
head(data01)

read.table() is an alternative function for importing data.

R Tip: Variable names

One of the largest sources of frustration with R can be importing data. Variable names often cause problems.

Do not have spaces in variable names

Use lower case letters

Abbreviate when possible

Average Height # BAD
Average.Height # OK   
average.height # BETTER  
avg.height     # GOOD!

Try It!

Import climate01.csv & climate02.csv. Save them as separate data frames.
Compare these two datasets and identify any errors in climate02.csv

Exporting data frames

# write the file 
write.csv(iris,"iris.csv", row.names=FALSE)

# check and see
list.files()

Questions?

Worksheet

Answers