Data Manipulation

Beginning with Data Frames

Ben Weinstein

What did we learn last time?

# Draw a 100 points from a random normal of 100 with mean =6, sd =7
x <- rnorm(100, 6, 7)
# view distribution
hist(x)

plot of chunk unnamed-chunk-1

# sample 5 without replacement
sample(x, 5)

## [1]  1.856  8.253  8.575 18.531  5.367

Beyond Vectors - Beginning with Data frames

A data frame is a very important data type in R. It's pretty much the de facto data structure for most tabular data and what we use for statistics. Data frames can have additional attributes such as rownames() and colnames(). This can be useful for annotating data.

Useful functions

1. head() - see first 5 rows
2. tail() - see last 5 rows
3. dim() - see dimensions

4. nrow() - number of rows
5. ncol() - number of columns
6. str() - structure of the object

Data frames are usually read in from file, but R comes with many practice datasets. We will use the iris dataset, famously used by Fisher in 1936 (http://en.wikipedia.org/wiki/Iris_flower_data_set)

head(iris)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.10	3.50	1.40	0.20	setosa
2	4.90	3.00	1.40	0.20	setosa
3	4.70	3.20	1.30	0.20	setosa
4	4.60	3.10	1.50	0.20	setosa
5	5.00	3.60	1.40	0.20	setosa
6	5.40	3.90	1.70	0.40	setosa

Try It!

Try the basic dataframe functions:

1. head()

2. tail() - see last 5 rows

3. dim() - see dimensions

4. nrow() - number of rows

5. ncol() - number of columns

6. str() - structure of the object

How many rows does the data have?
How many columns? What are the column names?
Using the str function, how many species are in the data?
What classes are each of the columns?

Dataframe syntax and subsetting

R has many powerful subset operators and mastering them will allow you to easily perform complex operation on any kind of dataset. Allows you to manipulate data very succinctly. Dataframes are akin to a series of vectors combined into a tabular structure.

head(iris)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.10	3.50	1.40	0.20	setosa
2	4.90	3.00	1.40	0.20	setosa
3	4.70	3.20	1.30	0.20	setosa
4	4.60	3.10	1.50	0.20	setosa
5	5.00	3.60	1.40	0.20	setosa
6	5.40	3.90	1.70	0.40	setosa

Dataframes syntax

Use the $ sign to call a column

x <- iris$Sepal.Length
head(x)

## [1] 5.1 4.9 4.7 4.6 5.0 5.4


y <- iris[, "Sepal.Length"]
head(y)

## [1] 5.1 4.9 4.7 4.6 5.0 5.4

Indexing

Indexing a dataframe is given by nameofdf[rows,columns]

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

# Identical to iris$Sepal.Length
SL <- iris[, "Sepal.Length"]
head(SL)

## [1] 5.1 4.9 4.7 4.6 5.0 5.4

SL2 <- iris[, 1]

Dataframes can be indexed for both rows and columns

Get all rows and the first two columns a<-iris[,1:2]

	Sepal.Length	Sepal.Width
1	5.10	3.50
2	4.90	3.00
3	4.70	3.20
4	4.60	3.10
5	5.00	3.60
6	5.40	3.90

Try It!

What is the 9th entry of the Sepal.Width column?
How would you get all entries for the 17th row?
Return an object with the 1 4 and 7 rows of the dataframe?
Use the seq command to get all odd rows in the dataset?
What happens when you use negative numbers?

Using logical statements

An extremely helpful tool is to subset your data using logic rather than an index. Let's begin with continious numeric data.

# Grab the Sepal.Width
x <- iris$Petal.Width
# Histogram
hist(x)

plot of chunk unnamed-chunk-7

Subset Using Statements

# Which values are greater than 1?
x <- iris$Petal.Width
logi <- x > 1

# Return all columns based on iris$Petal.Width
a <- iris[logi, ]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
54	5.50	2.30	4.00	1.30	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor

More Subsets

# Return all columns based on iris$Petal.Width values are greater than 1?
a <- iris[iris$Petal.Width > 1, ]

a <- iris[, 1:2]
# Return the same column as was subsetted.
head(x[logi])

## [1] 1.4 1.5 1.5 1.3 1.5 1.3

Return a different column based on the subset

# Return all values in Sepal.Length where Sepal.Width is greater than 1.
head(iris[logi, "Petal.Length"])

## [1] 4.7 4.5 4.9 4.0 4.6 4.5

Try It!

Why is iris[iris>3,] a nonsensical command?
What about iris[iris$Sepal.Length >3]?
Create a histogram of petal lengths for the entire data
Subset the data for values greater than two
Create a histogram of your new data

Other types of logical statements

Many different types of logical statements can be used to subset data.

For all types of data, if we want a specific value/character/factor we use ==

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
54	5.50	2.30	4.00	1.30	versicolor
63	6.00	2.20	4.00	1.00	versicolor
72	6.10	2.80	4.00	1.30	versicolor
90	5.50	2.50	4.00	1.30	versicolor
93	5.80	2.60	4.00	1.20	versicolor

Get only records from the species versicolor

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
54	5.50	2.30	4.00	1.30	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor

Logical statments

Combined using & (and) if you want both statements to be true, or | (or) if you want either statement to be true.

subsetX <- iris[iris$Petal.Length > 4 & iris$Species == "versicolor", ]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7.00	3.20	4.70	1.40	versicolor
52	6.40	3.20	4.50	1.50	versicolor
53	6.90	3.10	4.90	1.50	versicolor
55	6.50	2.80	4.60	1.50	versicolor
56	5.70	2.80	4.50	1.30	versicolor
57	6.30	3.30	4.70	1.60	versicolor

Try It!

Explain in words each of the following logical statements

iris[1:4,]
iris[c(1:15,),c(1,3)]
iris[iris$Species == "setosa","Petal.Width"]
What happens when you add a ! before a logical statment,

Hint compare: iris[iris$Species == "setosa",] with iris[!iris$Species == "setosa",]

What did we learn today?

Today we covered basic data manipulation of dataframes in R

Properties of a Data frame
Indexing columns
Subsetting Data
Logical Statements

Next time we will cover importing data, more subsetting and sampling, and introduction to bivariate plotting using ggplot.

plot of chunk unnamed-chunk-16