Introduction to Using R for Data Handling & Visualization

Amanda Mejia
BST 753 Spring 2014

overview

Reshaping Data
- reshape() function
- reshape2 package and melt() function
Summarizing Data
- dplyr package
Visualizing Data
- ggplot2 package

Part 1: Reshaping Data

"tidy data"

What is tidy data?
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table (won't talk much about this)
What is messy data?
- Column headers are values, not variables
- Multiple variables stored in one column (e.g. age and sex)
- Variables stored in both rows and columns
Source: Tidy Data (Wickham 2011)
To learn more about tidy data, check out Hadley Wickham's paper and presentation. Both have lots of examples.

messy vs. tidy (molten)

reshaping data

"Long" and "wide" data:
- In general, long = tidy and wide = messy
- Long data is generally good, but wide data is sometimes useful too
- "Long" and "wide" are relative terms
- Data can be reshaped from wide to long and vice-versa
Tools to reshape data:
- reshape2 package: melt() <- for wide to long
- ~~reshape2 package: cast() <- for long to wide~~
- stats package: reshape() <- for long to wide

reshape example

head(VADeaths, 10)  #wide format

##       Rural Male Rural Female Urban Male Urban Female
## 50-54       11.7          8.7       15.4          8.4
## 55-59       18.1         11.7       24.3         13.6
## 60-64       26.9         20.3       37.0         19.3
## 65-69       41.0         30.9       54.6         35.1
## 70-74       66.0         54.3       71.1         50.0

library(reshape2)
help(melt.data.frame)

reshape example (continued)

# go from wide to long format
VADeaths <- melt(data = VADeaths, id.vars = "Rural")
head(VADeaths)

##    Var1         Var2 value
## 1 50-54   Rural Male  11.7
## 2 55-59   Rural Male  18.1
## 3 60-64   Rural Male  26.9
## 4 65-69   Rural Male  41.0
## 5 70-74   Rural Male  66.0
## 6 50-54 Rural Female   8.7

names(VADeaths)[c(1, 3)] <- c("Age", "Deaths")

reshape example (continued)

# Var2 represents 2 variables
is.rural <- grepl("Rural", VADeaths$Var2)
VADeaths$Location <- ifelse(is.rural, "Rural", "Urban")
is.male <- grepl("Male", VADeaths$Var2)
VADeaths$Sex <- ifelse(is.male, "Male", "Female")

head(VADeaths)

##     Age         Var2 Deaths Location    Sex
## 1 50-54   Rural Male   11.7    Rural   Male
## 2 55-59   Rural Male   18.1    Rural   Male
## 3 60-64   Rural Male   26.9    Rural   Male
## 4 65-69   Rural Male   41.0    Rural   Male
## 5 70-74   Rural Male   66.0    Rural   Male
## 6 50-54 Rural Female    8.7    Rural Female

VADeaths$Var2 <- NULL

reshape example (continued)

# Another way: Logical Variables
VADeaths$Rural <- is.rural
VADeaths$Male <- is.male
head(VADeaths)

##     Age Deaths Location    Sex Rural  Male
## 1 50-54   11.7    Rural   Male  TRUE  TRUE
## 2 55-59   18.1    Rural   Male  TRUE  TRUE
## 3 60-64   26.9    Rural   Male  TRUE  TRUE
## 4 65-69   41.0    Rural   Male  TRUE  TRUE
## 5 70-74   66.0    Rural   Male  TRUE  TRUE
## 6 50-54    8.7    Rural Female  TRUE FALSE

reshape exercise

"Widen" the format of the VA data by creating a column for "Male Deaths" and "Female Deaths"
Can use reshape() or dcast()
Keep the long data frame, since we'll use it again

Part 2: Summarizing Data

dplyr package

Here's a nice guide
Fast (versus base, plyr, cast), designed for data frames (versus plyr)
Contains several function for common data manipulations
- filter() - select a subset of rows (faster than subset())
- select() - select a subset of columns
- arrange() - orders rows by variable(s)
- mutate() - add new columns
- summarise() - compute mean, median, count, and other summaries
- group_by():
  - specifies groups of rows defined by some variable(s)
  - mostly affects arrange() and summarise()
  - grouped data frames are nicely displayed (no extra effort required)

dplyr example

library(dplyr)
head(VADeaths)

##     Age Deaths Location    Sex Rural  Male
## 1 50-54   11.7    Rural   Male  TRUE  TRUE
## 2 55-59   18.1    Rural   Male  TRUE  TRUE
## 3 60-64   26.9    Rural   Male  TRUE  TRUE
## 4 65-69   41.0    Rural   Male  TRUE  TRUE
## 5 70-74   66.0    Rural   Male  TRUE  TRUE
## 6 50-54    8.7    Rural Female  TRUE FALSE

VADeaths$Rural <- NULL
VADeaths$Male <- NULL

dplyr example (continued)

VADeaths <- group_by(VADeaths, Age, Location)
VADeaths <- select(VADeaths, Age:Location)          #specify range of columns
VADeaths <- select(VADeaths, Age, Location, Deaths) #specify certain columns
VADeaths

## Source: local data frame [20 x 3]
## Groups: Age, Location
## 
##      Age Location Deaths
## 1  50-54    Rural   11.7
## 2  55-59    Rural   18.1
## 3  60-64    Rural   26.9
## 4  65-69    Rural   41.0
## 5  70-74    Rural   66.0
## 6  50-54    Rural    8.7
## 7  55-59    Rural   11.7
## 8  60-64    Rural   20.3
## 9  65-69    Rural   30.9
## 10 70-74    Rural   54.3
## 11 50-54    Urban   15.4
## 12 55-59    Urban   24.3
## 13 60-64    Urban   37.0
## 14 65-69    Urban   54.6
## 15 70-74    Urban   71.1
## 16 50-54    Urban    8.4
## 17 55-59    Urban   13.6
## 18 60-64    Urban   19.3
## 19 65-69    Urban   35.1
## 20 70-74    Urban   50.0

dplyr example (continued)

# calculate mean deaths and number of observations in each group
summarise(VADeaths, TotalDeaths = mean(Deaths), Count = n())

## Source: local data frame [10 x 4]
## Groups: Age
## 
##      Age Location TotalDeaths Count
## 1  70-74    Urban       60.55     2
## 2  65-69    Urban       44.85     2
## 3  60-64    Urban       28.15     2
## 4  55-59    Urban       18.95     2
## 5  50-54    Urban       11.90     2
## 6  70-74    Rural       60.15     2
## 7  65-69    Rural       35.95     2
## 8  60-64    Rural       23.60     2
## 9  55-59    Rural       14.90     2
## 10 50-54    Rural       10.20     2

dplyr example (continued)

filter(VADeaths, Deaths >= 20, Location == "Urban")

## Source: local data frame [6 x 3]
## Groups: Age, Location
## 
##     Age Location Deaths
## 1 55-59    Urban   24.3
## 2 60-64    Urban   37.0
## 3 65-69    Urban   54.6
## 4 70-74    Urban   71.1
## 5 65-69    Urban   35.1
## 6 70-74    Urban   50.0

dplyr exercise!

Use CO2 dataset

Compute average uptake and standard deviation by plant type and treatment
Compute a 95% confidence interval for each plant type and treatment
Compute a 95% confidence interval for a new "Uptake per Concentration" column

dplyr exercise solution

head(CO2)

# define groups by Type and Treatment
CO2 <- group_by(CO2, Type, Treatment)

# compute mean and standard deviation of uptake by Type and Treatment
summarise(CO2, uptake.mean = mean(uptake), uptake.sd = sd(uptake))

# compute values needed for confidence interval by Type and Treatment
CO2.CI <- summarise(CO2, uptake.mean = mean(uptake), uptake.sd = sd(uptake), 
    count = n())
CO2.CI <- mutate(CO2.CI, CI.U = uptake.mean + 1.96 * uptake.sd/sqrt(count), 
    CI.L = uptake.mean - 1.96 * uptake.sd/sqrt(count))

# create new 'uptake per concentration' column
CO2 <- mutate(CO2, uptake_per_conc = uptake/conc)

Part 3: Visualizing Data

ggplot2 package

two main functions
- qplot() - "quick" plot, similar syntax to base graphics
- ggplot() - full functionality of ggplot2
tips:
- start with qplot() and build up to ggplot()
- use the documentation at http://docs.ggplot2.org/current/
- find answers to many questions on StackOverflow (via Google)

ggplot example

library(ggplot2)
head(CO2)

##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2)  #default geom = point

plot of chunk unnamed-chunk-11

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake)) + geom_point()

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type)

plot of chunk unnamed-chunk-13

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point()

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type) + facet_grid(. ~ Treatment)

plot of chunk unnamed-chunk-15

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point() + 
    facet_grid(. ~ Treatment)

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type, group = Plant, geom = "line") + 
    facet_grid(. ~ Treatment, labeller = "label_both")

plot of chunk unnamed-chunk-17

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point() + 
    facet_grid(. ~ Treatment, labeller = "label_both") + geom_line(group = Plant)
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type, group = Plant)) + 
    geom_point() + facet_grid(. ~ Treatment, labeller = "label_both") + geom_line()

Introduction to Using R for Data Handling & Visualization

overview

Part 1: Reshaping Data

"tidy data"

messy vs. tidy (molten)

reshaping data

reshape example

reshape example (continued)

reshape example (continued)

reshape example (continued)

reshape exercise

Part 2: Summarizing Data

dplyr package

dplyr example

dplyr example (continued)

dplyr example (continued)

dplyr example (continued)

dplyr exercise!

dplyr exercise solution

Part 3: Visualizing Data

ggplot2 package

ggplot example

ggplot example (continued)

ggplot example (continued)

ggplot example (continued)

ggplot example (continued)

ggplot exercise!