Data Analysis in R Part 2 with reshape2 and ggplot2 and possibly other 2s

In this R markdown document we’re going to cover two topics: reshaping data with the reshape2 package and plotting data using the ggplot2 package.

I’ll show some examples and then we’ll write some R code together.

Generally speaking, data comes in two possible shapes, wide or tall. A common problem in data analysis is that the data you receive or generate may not always be in the format that you want/need it to be in for downstream analysis.

Let’s create a sample data frame and see how the reshape2 package can help us convert a data frame from wide to tall and tall to wide

To start, we’ll create a sample data frame that is in wide format

FirstName <- c("Mary", "Mike", "Greg")
age <- c(44, 52, 46)
IQ <- c(160, 95, 110)
people <- data.frame(FirstName, age, IQ)
people

##   FirstName age  IQ
## 1      Mary  44 160
## 2      Mike  52  95
## 3      Greg  46 110

This data frame is in ‘wide’ format which just means that each variable has its’ own column ‘Tall’ or ‘Long’ format has a separate row for each measurement in the data frame

They KEY to reshaping data is understanding which of your variables are identifier variables and which are measurement variables. In the people data frame FirstName is an identifier while while age and IQ are measurements.

Let’s use that knowledge to reshape this dataframe from wide to tall format using the reshape2 package. First we’ll install reshape2 and use the library function to make it available in our current session

library(reshape2)

The melt() function is how we take data from wide to tall format. The first argument for this function is simply the data frame that you want to melt. The second argument tells melt which variable(s) are identifier variables.

melted_people <- melt(people, id = "FirstName")

Let’s compare the original dataframe to the melted dataframe

people

##   FirstName age  IQ
## 1      Mary  44 160
## 2      Mike  52  95
## 3      Greg  46 110

melted_people

##   FirstName variable value
## 1      Mary      age    44
## 2      Mike      age    52
## 3      Greg      age    46
## 4      Mary       IQ   160
## 5      Mike       IQ    95
## 6      Greg       IQ   110

What happens if we have two id variables? Let’s see.

people$LastName <- c("Wilson", "Jones", "Smith")
people

##   FirstName age  IQ LastName
## 1      Mary  44 160   Wilson
## 2      Mike  52  95    Jones
## 3      Greg  46 110    Smith

people <- people[,c(1,4,2,3)]
people

##   FirstName LastName age  IQ
## 1      Mary   Wilson  44 160
## 2      Mike    Jones  52  95
## 3      Greg    Smith  46 110

melted_people2 <- melt(people, id = c("FirstName", "LastName"))

I’ll stop the demo now and melt with you

Load and melt the iris data set

data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

melted_iris <- melt(iris, id = "Species")
head(melted_iris,25)

##    Species     variable value
## 1   setosa Sepal.Length   5.1
## 2   setosa Sepal.Length   4.9
## 3   setosa Sepal.Length   4.7
## 4   setosa Sepal.Length   4.6
## 5   setosa Sepal.Length   5.0
## 6   setosa Sepal.Length   5.4
## 7   setosa Sepal.Length   4.6
## 8   setosa Sepal.Length   5.0
## 9   setosa Sepal.Length   4.4
## 10  setosa Sepal.Length   4.9
## 11  setosa Sepal.Length   5.4
## 12  setosa Sepal.Length   4.8
## 13  setosa Sepal.Length   4.8
## 14  setosa Sepal.Length   4.3
## 15  setosa Sepal.Length   5.8
## 16  setosa Sepal.Length   5.7
## 17  setosa Sepal.Length   5.4
## 18  setosa Sepal.Length   5.1
## 19  setosa Sepal.Length   5.7
## 20  setosa Sepal.Length   5.1
## 21  setosa Sepal.Length   5.4
## 22  setosa Sepal.Length   5.1
## 23  setosa Sepal.Length   4.6
## 24  setosa Sepal.Length   5.1
## 25  setosa Sepal.Length   4.8

Load and melt the mtcars data set

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

mtcars$model <- row.names(mtcars)
melted_cars <- melt(mtcars, id = "model")

How does this look with the Alzheimers data set?

Alz <- read.csv(file.choose(), header = T)
Alz$patient_ID <- row.names(Alz) 
Alz <- Alz[, c(133, 1:132)]
Alz_melted <- melt(Alz, id = 'patient_ID')

## Warning: attributes are not identical across measure variables; they will
## be dropped

What about going the other way? How can we take tall data and convert it to wide data? For this we use the dcast() function from reshape2. Again knowing what the identifier versus measured variables is the key here. The first argument of dcast is just the data set we want to cast while the second argument is a formula where the left side of the formula is our identifier variables while the right side is the column of measured variable names (not the values) that we want to cast out.

melted_people2

##   FirstName LastName variable value
## 1      Mary   Wilson      age    44
## 2      Mike    Jones      age    52
## 3      Greg    Smith      age    46
## 4      Mary   Wilson       IQ   160
## 5      Mike    Jones       IQ    95
## 6      Greg    Smith       IQ   110

cast_melted_people2 <- dcast(melted_people2, FirstName + LastName ~ variable)
cast_melted_people2

##   FirstName LastName age  IQ
## 1      Greg    Smith  46 110
## 2      Mary   Wilson  44 160
## 3      Mike    Jones  52  95

Using ggplot2 package to plot/visualize data

ggplot2 is an R package that is based on something called the ‘grammar of graphics’ which is a theory related to how a graphic or data visualization can be described and built

There are two key items in a ggplot: Aesthetics and Geoms Aesthetics describe what you want to plot and how to map variables Geoms describe what type of objects you want to appear on your plot

Let’s look at a quick example

library(ggplot2)
ggplot(Alz, aes(x = tau, y = SOD)) + geom_point()

ggplot(Alz, aes(x = tau, y = SOD, color = response)) + geom_point()

ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point()

ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point() + facet_grid(. ~ response)

levels(Alz$gender)[levels(Alz$gender)=='female'] <- 'Female'
levels(Alz$gender)[levels(Alz$gender)=='male'] <- 'Male'
levels(Alz$gender)[levels(Alz$gender)=='M'] <- 'Male'
ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point() + facet_grid(gender ~ response)

Data Analysis in R Part 2 with reshape2 and ggplot2 and possibly other 2s

Stephen Guest

June 28, 2016