Data

Import, exploration, and plotting

Ben Weinstein

Drawing

As scientists, our major currency is data. R provides a common ground for data analysis. Using R for data visualization, exploration, and analysis opens up a massive set of tools.

You will find that nothing, absolutely nothing, you will do has not been atleast tried before. There are packages covering every imaginable type of ecological, evolutionary, and statiscal approaches. Today we will discuss how to read in data, perform basic functions, and produce figures.

Tips for Importing Data

R is not a spreadsheet application. Enter data in excel or access, and export it to R for analysis.

Data is easiest to read in a .csv format, in excel, save as -> comma seperated csv.

Avoid Spaces, Special Characters, or hanging lines of data.

traits <- read.csv("C:/Users/Jorge/Documents/IntroR/05-DataExploration/Traits.csv", 
    row.names = 1)
Clade Genus Species double English Bill Mass WingChord
1 Coquettes Adelomyia melanogenys Adelomyia melanogenys Speckled Hummingbird 15.04 4.25 55.87
2 Brilliant Aglaeactis cupripennis Aglaeactis cupripennis Shining Sunbeam 18.71 8.44 85.62
3 Coquettes Aglaiocercus coelestis Aglaiocercus coelestis Violet-tailed Sylph 16.25 6.07 68.68
4 Coquettes Aglaiocercus kingi Aglaiocercus kingi Long-tailed Sylph 15.77 5.53 67.12
5 Emerald Amazilia amazilia Amazilia amazilia Amazilia Hummingbird 18.54 4.07 53.33
6 Emerald Amazilia castaneiventris Amazilia castaneiventris Chestnut-bellied Hummingbird 18.70 4.75 52.70

Data Explortation

It is critical to consider your data carefully. Are they categorical, are they numeric, how much variance is there? Are they complete? For categorical data, the best place to start are contingency tables How many speccies per clades are there? For continues data, try range, sd mean

table(traits$Clade)
## 
##       Bee Brilliant Coquettes   Emerald    Hermit    Mangoe     MtGem 
##         5        37        27        34        22        13         1 
##  Patagona   Topazes 
##         1         1
mean(traits$Bill)
## [1] 21.41

Try It!

  1. What are the range of body sizes?
  2. Which genus has the most species?
  3. Look up the which.max function; read the help screen; which species has the longest bill?
  4. Create a two way table of genus and clade, what does this show?

Ensifera ensifera

ggplot2

The ggplot library is the gold-standard for plotting. It allows basic, intuitive, plots that can be endlessly customized. The help screens are full of clear examples, and there is a massive online community to search basic plotting questions. Let's explore our first plot.

library(ggplot2)
ggplot(traits, aes(x = WingChord, y = Mass)) + geom_point()

plot of chunk unnamed-chunk-5

  • Parsed: Create a plot from the data frame traits, with matching the datatype and properties of the column WingChord on the x axis, and Mass on the Y axis. Add points.

For now, we will always be setting global aesthestics inside the ggplot() and not the geom_point().

ggplot(traits, aes(x = Clade, y = Mass)) + geom_point()

plot of chunk unnamed-chunk-6

# What if we want something besides points

There are many geom styles type geom_ and hit tab to see types, and then get help using ?geom_nameofgeom

ggplot(traits, aes(x = Clade, y = Mass)) + geom_boxplot()

plot of chunk unnamed-chunk-7

Building more complex plots

  • Continious colors can be added (and edited) to add more information
  • Parsed: Create a plot from the data frame traits. The x axis is the column Mass, the y axis is the column WingChord, color the data by the continious variable column Bill. Add points.
p <- ggplot(traits, aes(x = Mass, y = WingChord, color = Bill))
p + geom_point()

plot of chunk unnamed-chunk-8

Building even more complex plots

Shapes and sizes can be added as well, note how ggplot automatically groups by both variables. In this case we have a bit too many to make it helpful, but it depends on your data.

  • Parsed:Create a plot from the dataframe traits, with Mass on the x axis, WingChord on the y. Color the data by the continious variable Bill, and add shapes to the data based on the categorical data Clade. Add points.
ggplot(traits, aes(x = Mass, y = WingChord, color = Bill, shape = Clade)) + 
    geom_point()

plot of chunk unnamed-chunk-9

-ggplot is very smart. Trust it.

Shapes and sizes can be added as well, note how ggplot automatically groups by both variables. In this case we have a bit too many to make it helpful, but it depends on your data.

ggplot(traits, aes(x = Bill, y = WingChord, color = Clade, size = Mass)) + geom_point()

plot of chunk unnamed-chunk-10

ggplot is very smart. Trust it.

Try it!

  1. Plot Bill as a function of Wingchord, save it as object p
  2. Plot Bill against clade, which clade has the lowest median bill size?
  3. Look up geom_histogram, what does it go? made a histogram of Bill sizes.
  4. Color your histogram by clade, which clade does the outlier belong to?

Adding multiple geometries to a plot

Often we want to express more information than a single geometric object, ggplot allows us immense flexiblity by allowing us to build on our initial plot

ggplot(traits, aes(x = Mass, y = Bill)) + geom_point() + geom_smooth()

plot of chunk unnamed-chunk-11

Practice plotting

To show some more features, let's make a bit smaller dataset

Given a bit smaller dataset, we can explore more options, add both color and shape, drawing on what we've done already, how would we subset our data to just get the coquettes clades?

Bearded HelmetCrest

coq <- droplevels(traits[traits$Clade == "Coquettes", ])

Text can be added, and manipulated directly

ggplot(coq, aes(x = Bill, y = WingChord, size = Mass, label = Genus)) + geom_point() + 
    geom_text(size = 3)

plot of chunk unnamed-chunk-13

Data can be facetted into panels

ggplot(coq, aes(x = Species, y = WingChord, col = Bill, size = Mass)) + geom_point() + 
    facet_grid(~Genus, scales = "free") + theme(axis.text.x = element_text(angle = -90)) + 
    scale_color_continuous(low = "blue", high = "red") + ylab("Wing Length")

plot of chunk unnamed-chunk-14

Parsed: Plot the dataframe coq, with Species on the x axis, with WingChord on the y axis, color the data by Bill size, and adjust the size of the data based on Mass. Add points. Create a panel for each Genus, make the x axis different for each panel. Rotate the x axis labels by 90degrees. Change the color of the Bill size from blue to red. Label the y axis "Wing length"

10min Group Assignment

Come up with a simple question and represent it graphically

Exporting Dataframes

If we want to export the data that we created, we can save it to file as a csv

write.csv(coq, "Coquettes.csv")