Introduction to Using R for Data Handling & Visualization

Amanda Mejia
BST 753 Spring 2014

overview

  1. Reshaping Data
    • reshape() function
    • reshape2 package and melt() function
  2. Summarizing Data
    • dplyr package
  3. Visualizing Data
    • ggplot2 package

Part 1: Reshaping Data

"tidy data"

  • What is tidy data?

    • Each variable is a column
    • Each observation is a row
    • Each type of observational unit is a table (won't talk much about this)
  • What is messy data?

    • Column headers are values, not variables
    • Multiple variables stored in one column (e.g. age and sex)
    • Variables stored in both rows and columns
  • Source: Tidy Data (Wickham 2011)

  • To learn more about tidy data, check out Hadley Wickham's paper and presentation. Both have lots of examples.

messy vs. tidy (molten)

reshaping data

  • "Long" and "wide" data:
    • In general, long = tidy and wide = messy
    • Long data is generally good, but wide data is sometimes useful too
    • "Long" and "wide" are relative terms
    • Data can be reshaped from wide to long and vice-versa
  • Tools to reshape data:
    • reshape2 package: melt() <- for wide to long
    • reshape2 package: cast() <- for long to wide
    • stats package: reshape() <- for long to wide

reshape example

head(VADeaths, 10)  #wide format
##       Rural Male Rural Female Urban Male Urban Female
## 50-54       11.7          8.7       15.4          8.4
## 55-59       18.1         11.7       24.3         13.6
## 60-64       26.9         20.3       37.0         19.3
## 65-69       41.0         30.9       54.6         35.1
## 70-74       66.0         54.3       71.1         50.0
library(reshape2)
help(melt.data.frame)

reshape example (continued)

# go from wide to long format
VADeaths <- melt(data = VADeaths, id.vars = "Rural")
head(VADeaths)
##    Var1         Var2 value
## 1 50-54   Rural Male  11.7
## 2 55-59   Rural Male  18.1
## 3 60-64   Rural Male  26.9
## 4 65-69   Rural Male  41.0
## 5 70-74   Rural Male  66.0
## 6 50-54 Rural Female   8.7
names(VADeaths)[c(1, 3)] <- c("Age", "Deaths")

reshape example (continued)

# Var2 represents 2 variables
is.rural <- grepl("Rural", VADeaths$Var2)
VADeaths$Location <- ifelse(is.rural, "Rural", "Urban")
is.male <- grepl("Male", VADeaths$Var2)
VADeaths$Sex <- ifelse(is.male, "Male", "Female")

head(VADeaths)
##     Age         Var2 Deaths Location    Sex
## 1 50-54   Rural Male   11.7    Rural   Male
## 2 55-59   Rural Male   18.1    Rural   Male
## 3 60-64   Rural Male   26.9    Rural   Male
## 4 65-69   Rural Male   41.0    Rural   Male
## 5 70-74   Rural Male   66.0    Rural   Male
## 6 50-54 Rural Female    8.7    Rural Female
VADeaths$Var2 <- NULL

reshape example (continued)

# Another way: Logical Variables
VADeaths$Rural <- is.rural
VADeaths$Male <- is.male
head(VADeaths)
##     Age Deaths Location    Sex Rural  Male
## 1 50-54   11.7    Rural   Male  TRUE  TRUE
## 2 55-59   18.1    Rural   Male  TRUE  TRUE
## 3 60-64   26.9    Rural   Male  TRUE  TRUE
## 4 65-69   41.0    Rural   Male  TRUE  TRUE
## 5 70-74   66.0    Rural   Male  TRUE  TRUE
## 6 50-54    8.7    Rural Female  TRUE FALSE

reshape exercise

  • "Widen" the format of the VA data by creating a column for "Male Deaths" and "Female Deaths"
  • Can use reshape() or dcast()
  • Keep the long data frame, since we'll use it again

Part 2: Summarizing Data

dplyr package

  • Here's a nice guide
  • Fast (versus base, plyr, cast), designed for data frames (versus plyr)
  • Contains several function for common data manipulations
    • filter() - select a subset of rows (faster than subset())
    • select() - select a subset of columns
    • arrange() - orders rows by variable(s)
    • mutate() - add new columns
    • summarise() - compute mean, median, count, and other summaries
    • group_by():
      • specifies groups of rows defined by some variable(s)
      • mostly affects arrange() and summarise()
      • grouped data frames are nicely displayed (no extra effort required)

dplyr example

library(dplyr)
head(VADeaths)
##     Age Deaths Location    Sex Rural  Male
## 1 50-54   11.7    Rural   Male  TRUE  TRUE
## 2 55-59   18.1    Rural   Male  TRUE  TRUE
## 3 60-64   26.9    Rural   Male  TRUE  TRUE
## 4 65-69   41.0    Rural   Male  TRUE  TRUE
## 5 70-74   66.0    Rural   Male  TRUE  TRUE
## 6 50-54    8.7    Rural Female  TRUE FALSE
VADeaths$Rural <- NULL
VADeaths$Male <- NULL

dplyr example (continued)

VADeaths <- group_by(VADeaths, Age, Location)
VADeaths <- select(VADeaths, Age:Location)          #specify range of columns
VADeaths <- select(VADeaths, Age, Location, Deaths) #specify certain columns
VADeaths
## Source: local data frame [20 x 3]
## Groups: Age, Location
## 
##      Age Location Deaths
## 1  50-54    Rural   11.7
## 2  55-59    Rural   18.1
## 3  60-64    Rural   26.9
## 4  65-69    Rural   41.0
## 5  70-74    Rural   66.0
## 6  50-54    Rural    8.7
## 7  55-59    Rural   11.7
## 8  60-64    Rural   20.3
## 9  65-69    Rural   30.9
## 10 70-74    Rural   54.3
## 11 50-54    Urban   15.4
## 12 55-59    Urban   24.3
## 13 60-64    Urban   37.0
## 14 65-69    Urban   54.6
## 15 70-74    Urban   71.1
## 16 50-54    Urban    8.4
## 17 55-59    Urban   13.6
## 18 60-64    Urban   19.3
## 19 65-69    Urban   35.1
## 20 70-74    Urban   50.0

dplyr example (continued)

# calculate mean deaths and number of observations in each group
summarise(VADeaths, TotalDeaths = mean(Deaths), Count = n())
## Source: local data frame [10 x 4]
## Groups: Age
## 
##      Age Location TotalDeaths Count
## 1  70-74    Urban       60.55     2
## 2  65-69    Urban       44.85     2
## 3  60-64    Urban       28.15     2
## 4  55-59    Urban       18.95     2
## 5  50-54    Urban       11.90     2
## 6  70-74    Rural       60.15     2
## 7  65-69    Rural       35.95     2
## 8  60-64    Rural       23.60     2
## 9  55-59    Rural       14.90     2
## 10 50-54    Rural       10.20     2

dplyr example (continued)

filter(VADeaths, Deaths >= 20, Location == "Urban")
## Source: local data frame [6 x 3]
## Groups: Age, Location
## 
##     Age Location Deaths
## 1 55-59    Urban   24.3
## 2 60-64    Urban   37.0
## 3 65-69    Urban   54.6
## 4 70-74    Urban   71.1
## 5 65-69    Urban   35.1
## 6 70-74    Urban   50.0

dplyr exercise!

Use CO2 dataset

  1. Compute average uptake and standard deviation by plant type and treatment
  2. Compute a 95% confidence interval for each plant type and treatment
  3. Compute a 95% confidence interval for a new "Uptake per Concentration" column

dplyr exercise solution

head(CO2)

# define groups by Type and Treatment
CO2 <- group_by(CO2, Type, Treatment)

# compute mean and standard deviation of uptake by Type and Treatment
summarise(CO2, uptake.mean = mean(uptake), uptake.sd = sd(uptake))

# compute values needed for confidence interval by Type and Treatment
CO2.CI <- summarise(CO2, uptake.mean = mean(uptake), uptake.sd = sd(uptake), 
    count = n())
CO2.CI <- mutate(CO2.CI, CI.U = uptake.mean + 1.96 * uptake.sd/sqrt(count), 
    CI.L = uptake.mean - 1.96 * uptake.sd/sqrt(count))

# create new 'uptake per concentration' column
CO2 <- mutate(CO2, uptake_per_conc = uptake/conc)

Part 3: Visualizing Data

ggplot2 package

  • two main functions
    • qplot() - "quick" plot, similar syntax to base graphics
    • ggplot() - full functionality of ggplot2
  • tips:
    • start with qplot() and build up to ggplot()
    • use the documentation at http://docs.ggplot2.org/current/
    • find answers to many questions on StackOverflow (via Google)

ggplot example

library(ggplot2)
head(CO2)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2)  #default geom = point

plot of chunk unnamed-chunk-11

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake)) + geom_point()

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type)

plot of chunk unnamed-chunk-13

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point()

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type) + facet_grid(. ~ Treatment)

plot of chunk unnamed-chunk-15

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point() + 
    facet_grid(. ~ Treatment)

ggplot example (continued)

qplot(x = conc, y = uptake, data = CO2, colour = Type, group = Plant, geom = "line") + 
    facet_grid(. ~ Treatment, labeller = "label_both")

plot of chunk unnamed-chunk-17

# ggplot syntax
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type)) + geom_point() + 
    facet_grid(. ~ Treatment, labeller = "label_both") + geom_line(group = Plant)
ggplot(data = CO2, aes(x = conc, y = uptake, colour = Type, group = Plant)) + 
    geom_point() + facet_grid(. ~ Treatment, labeller = "label_both") + geom_line()

ggplot exercise!