In this R markdown document we’re going to cover two topics: reshaping data with the reshape2 package and plotting data using the ggplot2 package.
I’ll show some examples and then we’ll write some R code together.
Generally speaking, data comes in two possible shapes, wide or tall. A common problem in data analysis is that the data you receive or generate may not always be in the format that you want/need it to be in for downstream analysis.
Let’s create a sample data frame and see how the reshape2 package can help us convert a data frame from wide to tall and tall to wide
To start, we’ll create a sample data frame that is in wide format
FirstName <- c("Mary", "Mike", "Greg")
age <- c(44, 52, 46)
IQ <- c(160, 95, 110)
people <- data.frame(FirstName, age, IQ)
people
## FirstName age IQ
## 1 Mary 44 160
## 2 Mike 52 95
## 3 Greg 46 110
This data frame is in ‘wide’ format which just means that each variable has its’ own column ‘Tall’ or ‘Long’ format has a separate row for each measurement in the data frame
They KEY to reshaping data is understanding which of your variables are identifier variables and which are measurement variables. In the people data frame FirstName is an identifier while while age and IQ are measurements.
Let’s use that knowledge to reshape this dataframe from wide to tall format using the reshape2 package. First we’ll install reshape2 and use the library function to make it available in our current session
library(reshape2)
The melt() function is how we take data from wide to tall format. The first argument for this function is simply the data frame that you want to melt. The second argument tells melt which variable(s) are identifier variables.
melted_people <- melt(people, id = "FirstName")
Let’s compare the original dataframe to the melted dataframe
people
## FirstName age IQ
## 1 Mary 44 160
## 2 Mike 52 95
## 3 Greg 46 110
melted_people
## FirstName variable value
## 1 Mary age 44
## 2 Mike age 52
## 3 Greg age 46
## 4 Mary IQ 160
## 5 Mike IQ 95
## 6 Greg IQ 110
What happens if we have two id variables? Let’s see.
people$LastName <- c("Wilson", "Jones", "Smith")
people
## FirstName age IQ LastName
## 1 Mary 44 160 Wilson
## 2 Mike 52 95 Jones
## 3 Greg 46 110 Smith
people <- people[,c(1,4,2,3)]
people
## FirstName LastName age IQ
## 1 Mary Wilson 44 160
## 2 Mike Jones 52 95
## 3 Greg Smith 46 110
melted_people2 <- melt(people, id = c("FirstName", "LastName"))
I’ll stop the demo now and melt with you
Load and melt the iris data set
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
melted_iris <- melt(iris, id = "Species")
head(melted_iris,25)
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
## 7 setosa Sepal.Length 4.6
## 8 setosa Sepal.Length 5.0
## 9 setosa Sepal.Length 4.4
## 10 setosa Sepal.Length 4.9
## 11 setosa Sepal.Length 5.4
## 12 setosa Sepal.Length 4.8
## 13 setosa Sepal.Length 4.8
## 14 setosa Sepal.Length 4.3
## 15 setosa Sepal.Length 5.8
## 16 setosa Sepal.Length 5.7
## 17 setosa Sepal.Length 5.4
## 18 setosa Sepal.Length 5.1
## 19 setosa Sepal.Length 5.7
## 20 setosa Sepal.Length 5.1
## 21 setosa Sepal.Length 5.4
## 22 setosa Sepal.Length 5.1
## 23 setosa Sepal.Length 4.6
## 24 setosa Sepal.Length 5.1
## 25 setosa Sepal.Length 4.8
Load and melt the mtcars data set
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mtcars$model <- row.names(mtcars)
melted_cars <- melt(mtcars, id = "model")
How does this look with the Alzheimers data set?
Alz <- read.csv(file.choose(), header = T)
Alz$patient_ID <- row.names(Alz)
Alz <- Alz[, c(133, 1:132)]
Alz_melted <- melt(Alz, id = 'patient_ID')
## Warning: attributes are not identical across measure variables; they will
## be dropped
What about going the other way? How can we take tall data and convert it to wide data? For this we use the dcast() function from reshape2. Again knowing what the identifier versus measured variables is the key here. The first argument of dcast is just the data set we want to cast while the second argument is a formula where the left side of the formula is our identifier variables while the right side is the column of measured variable names (not the values) that we want to cast out.
melted_people2
## FirstName LastName variable value
## 1 Mary Wilson age 44
## 2 Mike Jones age 52
## 3 Greg Smith age 46
## 4 Mary Wilson IQ 160
## 5 Mike Jones IQ 95
## 6 Greg Smith IQ 110
cast_melted_people2 <- dcast(melted_people2, FirstName + LastName ~ variable)
cast_melted_people2
## FirstName LastName age IQ
## 1 Greg Smith 46 110
## 2 Mary Wilson 44 160
## 3 Mike Jones 52 95
Using ggplot2 package to plot/visualize data
ggplot2 is an R package that is based on something called the ‘grammar of graphics’ which is a theory related to how a graphic or data visualization can be described and built
There are two key items in a ggplot: Aesthetics and Geoms Aesthetics describe what you want to plot and how to map variables Geoms describe what type of objects you want to appear on your plot
Let’s look at a quick example
library(ggplot2)
ggplot(Alz, aes(x = tau, y = SOD)) + geom_point()
ggplot(Alz, aes(x = tau, y = SOD, color = response)) + geom_point()
ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point()
ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point() + facet_grid(. ~ response)
levels(Alz$gender)[levels(Alz$gender)=='female'] <- 'Female'
levels(Alz$gender)[levels(Alz$gender)=='male'] <- 'Male'
levels(Alz$gender)[levels(Alz$gender)=='M'] <- 'Male'
ggplot(Alz, aes(x = tau, y = SOD, color = response, size = age)) + geom_point() + facet_grid(gender ~ response)