Lab 2: Displaying & Describing Data

Lab 2 Outline

1)Run code used to generate different kinds of graphs discussed in class

2)Re-run eagles code from Lab 1

3)Add new data to a state you’ve been assigned

4)Calculate summary stats for that state & fill out in-class worksheet

5)Create polished eagle graph

Based on Chapters 2 & 3, Whitlock and Schulter, 2nd Ed











Example data: Fisher’s Irises

Loading data into R the easy way:

pre-made data in a “Package”

Getting data into R (or SAS, or ArcGIS…) can be a pain

R comes with many datasets that are pre-loaded into it

There are also many stat. techniques that can easily be added to R

These are contined in “packages”

Load data that is already in the “base” distribution of R

Fisher’s iris dat comes automaically with R. You can load it into memory using the command “data()”

#Load the iris data
data(iris)

Look at the iris data

You can check that it was loaded using the ls() command (“list”).

ls()
## [1] "iris"

You can get info about the nature of the daframe using commands like dim()

dim(iris)
## [1] 150   5

This tells us that the iris data is essentially a spreadhshett that has 150 rows and 5 columns.

We can get the column names with names()

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Note that the first letter of each word is capitalized. What are the implications of this?

The top of the data and the bottom of the data can be checked with head() and tail

#top of data
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#bottom of data
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Another common R command is is(), which tells you what something is in R land

is(iris)
## [1] "data.frame" "list"       "oldClass"   "vector"

R might spew a lot of things out at you; usually the first item is most important. Here, it tells us that the “object” called iris in your workspace is first and foremost a “data.frame”, which is esseentially a spreadsheet of data loaded into R.

You can get basic info about the data themseleves using comamnds like summary()

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

If you wanted information on just a single column, you would tell R to isolate that column like this

summary(iris$Sepal.Width)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400

That is, that name of the dataframe, a dollar sign ($), and the name of the column.

What happens when you don’t capitalize something?

#all lower case
# summary(iris$sepal.width) # this won't work

#just "s" in "sepal" lower case
# summary(iris$sepal.Width)  #this won't work either

#or what if you capitalize "i" in "Iris"?
# summary(Iris$Sepal.Width) #won't work either

The 1st two error messages are not very informative; the 3rd one does make a little sense.

Load data that is in another R package part 1

Packages that come w/ R

Many scientists develop software for R, and they often include datasets to demonstrate how the software works. Some of this software, called a “package” comes with R already and just needs to be loaded. This is done with the library() command.

The MASS package comes with R when you download it.

#Load the MASS package
library(MASS)

MASS contains a dataset called called “mammals”

data(mammals)

You can confirm that the mammals data is in your workspace using ls()

ls()
## [1] "iris"    "mammals"

You should now have the iris and the mammals data.

What is in the mammals dataset? Datasets actually usually have useful help files. Access help using the “?” function.

?mammals
## starting httpd help server ...
##  done

The help screen you pop up. It tells us that mammals is “A data frame with average brain and body weights for 62 species of land mammals.” At tghe bottom we can see that these data come from the paper

“Selected from: Allison, T. and Cicchetti, D. V. (1976) Sleep in mammals: ecological and constitutional correlates. Science 194, 732-734.”

We can learn about the mammals data usig the usual commands

dim(mammals)
## [1] 62  2
names(mammals)
## [1] "body"  "brain"
head(mammals)
##                    body brain
## Arctic fox        3.385  44.5
## Owl monkey        0.480  15.5
## Mountain beaver   1.350   8.1
## Cow             465.000 423.0
## Grey wolf        36.330 119.5
## Goat             27.660 115.0
tail(mammals)
##                    body brain
## Echidna           3.000  25.0
## Brazilian tapir 160.000 169.0
## Tenrec            0.900   2.6
## Phalanger         1.620  11.4
## Tree shrew        0.104   2.5
## Red fox           4.235  50.4
summary(mammals)
##       body              brain        
##  Min.   :   0.005   Min.   :   0.14  
##  1st Qu.:   0.600   1st Qu.:   4.25  
##  Median :   3.342   Median :  17.25  
##  Mean   : 198.790   Mean   : 283.13  
##  3rd Qu.:  48.203   3rd Qu.: 166.00  
##  Max.   :6654.000   Max.   :5712.00

Load data that is in another R package part 2

Packages from CRAN

Most packages don’t come with R when you download it but are stored in a central site called CRAN. We’ll load the “doBy” package.

Loading packages using R-Studio

RStudio makes it easy to find and load packages.
* In the panel of RStudio that has the tabs “Plots”, “Packages”,“Help”, “Viewer” click on packages * Onthe next line it says “Install” and “Update”. Click on “Install” * A window will pop up. In the white field in the middle of the window under “Packages” type the name of the package you want. * RStudio will automatically bring up potential packages as you type. * Finish typing “doBy” or click on the name. * Click on the “Install” button. * In the source viewere some misc. test should show up. Most of the time this works. If it doesn’t, talk to the professor!

Loading packages using code

You can also use the install.packages() command to try to load the package. I already have the “doBy” download so I have “commented out” the code with a “#”. To run the code, remove the “#”. If you already followed the instructions above you don’t need to run the code.

#install.packages("doBy")

What if you tell R to install a package you already have downloaded?

If you already have the package downloaded to your computer then a window will pop up asking you if you want to restart your computer. Normally this isn’t necessary; just click “no”. You might see a “warning” message pop up in the console such as “Warning in install.packages: package ‘doBy’ is in use and will not be installed”. This isn’t a problem










What if I can’t get a package I need loaded?

  • Talk to someone who is good w/R (eg, your professor)
  • Google something like “how to install R package” for general info
  • Google something like “problem loading R package”
  • Copy and paste any error message you might be getting into Google and see if anyone has written about this problem










Boxplots

Now we’re going to finally plot some data










R function: boxplot(…)

Very very easy to make boxplots in R

Very very very very hard to make in Excel

You should be making these all the time to explore your data

Also good way to present data in a final paper

Making a basic plot is faily easy. This basic structure is used frequently in R so we will discuss it in some detial.

#Load the iris data using the data() command
data(iris) #load iris data in case you haven't

#make a very basic boxplot
boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris)               #the data to plot










Anatomy of an R command

Key features of the comman are * the “~”" * the “comma” * the “data = iris” * capitalization

Anatomy of the boxplot command










Box plot with labels

R will usually generate labels for the x and y axes based on the command. These need to be changed.


“xlab =” sets the labels for the x-axis, “ylab” for the y axis. Note that these both occur inside the paretheses, and that the text for the labels goes in quotes. Forgetting the quotes will cause the code to fail.

boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris,           #the data to plot
        xlab = "Iris Species", #Label for x axis
        ylab = "Sepal Length (mm)" ) #Label for y axis, w/ units










Changing colors in R plots part 1

If we wanted we could change the color of the boxplots using the universal command “col =”. This code can be used to change the color of most types of plots in R. This doesn’t increase the information content of the figure but maybe makes it nicer to look at.

boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris,           #the data to plot
        xlab = "Iris Species", #Label for x axis
        ylab = "Sepal Length",
        col = 3) 










Changing colors in R plots part 2

If we want we could set the color of each box different. This accents that each box is a differnet Iris species. The code “col = 2:4” tells R to use the sequence of colors “2, 3, 4”. I skip “1” b/c its black and will obscure the bar that indicates the median

#boxplot w/ each color different
boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris,           #the data to plot
        xlab = "Iris Species", #Label for x axis
        ylab = "Sepal Length",
        col = 2:4) 










This next code uses a slight variant. I’ve reversed the order of the colorrs so you can see the difference. I’ve use “col = c(4,3,2)” which is longhand for “col = 4:2”

boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris,           #the data to plot
        xlab = "Iris Species", #Label for x axis
        ylab = "Sepal Length",
        col = c(4,3,2)) 












We could also do this

boxplot(Sepal.Length ~ Species,    #the plotting formula of y ~ x
        data = iris,           #the data to plot
        xlab = "Iris Species", #Label for x axis
        ylab = "Sepal Length",
        col = c("red","green","blue")) 












The “c(…)” Function

This shows up a lot in R code.

can have numbers inside it, separate by a colon

c(2:4)

can have numbers in it, sep. by commas

c(4,3,2)

can have text in it, sep by commas AND w/ " "

c(“red”,“green”,“blue”)












Plotting Means w/ Error Bars

library(MASS)
data(mammals)

#The errbar() function used below is in the Hmisc package
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
n <- dim(mammals)[1]
u.body <- mean(mammals$body)
u.brain <- mean(mammals$brain)

se.body <- sd(mammals$body)/sqrt(n)
se.brain <- sd(mammals$brain)/sqrt(n)

mam.df <- data.frame(body.part  = c("body","brain"),
                     mean = c(u.body,u.brain),
                     SE = c(se.body,se.brain))

errbar(x = 1:2,
       y = mam.df$mean,
       yplus =mam.df$mean+ mam.df$SE,yminus = mam.df$mean-mam.df$SE,
       xlab = "Body Part",ylab = "Mass",
       xlim=c(0.5,2.5),
       xaxt="n",
       cex =3)

axis(side=1,at=1:2,labels=mam.df$body.part)
legend("topleft", legend = "Error bars = SE",bty = "n") 
legend("bottomleft", legend = "n = 62 spp",bty = "n") 











Histograms of DISCRETE/CATEGORICAL data

Number of extinct birds from each Hawaiian island
-Shows the frequency of each category
-Error bars are not possible w/these data
-This type of plot is useful for general descriptions of data
#Data from Science paper by biogeographer...not Gaston...
Island <- c("Hawaii","Kauai","Lana","Maui","Molokai","Oahu")
Extinctions <- c(11,1,7,4,7,7)
i <- order(Extinctions,decreasing = T)
barplot(height = Extinctions[i], names.arg = Island[i],
        col = 1:length(Island))

Histograms of CONTINUOUS data

Histograms are very useful when you

-explore datasets you are seeing for the 1st time

-display data that is skewed or oddly shaped

-Convey similar idea as boxplot

-vertical lines often added to show mean, median, etc

-made with “hist()”" in R

A basic histogram

This plot shows the distribution of birthweights from a set of babies born in a hospital.

library(MASS)
data(birthwt)
hist(birthwt$bwt)

Add labels

library(MASS)
hist(birthwt$bwt, 
     xlab = "Birthweight (grams)",
     ylab = "Frequency")

Add reference lines using the abline() command

library(MASS)
hist(birthwt$bwt, 
     xlab = "Birthweight (grams)",
     ylab = "Frequency")
abline(v = mean(birthwt$bwt), col = 2, lwd = 3,lty =3)

Add reference line for the size of my son

hist(birthwt$bwt, 
     xlab = "Birthweight (grams)",
     ylab = "Frequency")
abline(v = mean(birthwt$bwt), col = 2, lwd = 3,lty =3) #first reference line for the mean
abline(v = 4422.525607500001, col = 3, lwd = 3,lty =1) #Dr. Brouwer's son jude











Two Continuous variables: Scatter Plots

Scatter plots are standard way of for plotting two continous variables against each other
Frequently used to visualize data for “regression” analysis
We frequently take the log of numerical data - more on this later
A basic plot
#"Mammal body-brain allometry"
library(MASS)
data(mammals)
plot(log(brain) ~ log(body), 
     data = mammals)

Add some labels
#"Mammal body-brain allometry"
library(MASS)
data(mammals)
plot(log(brain) ~ log(body), 
     data = mammals, 
     xlab = "Log of animal body mass",
     ylab = "log(brain mass)")

Increase the size of the points wiht cex = 2

plot(log(brain) ~ log(body), 
     data = mammals, 
     cex = 2,
     xlab = "Log of animal body mass",
     ylab = "log(brain mass)")

Change the shape of the point being plotted w/ “pch = 2”

plot(log(brain) ~ log(body), 
     data = mammals, 
     cex = 2,
     pch =2,
     xlab = "Log of animal body mass",
     ylab = "log(brain mass)")

Increase the thickness of the lines used to draw the points with “lwd=2”

plot(log(brain) ~ log(body), 
     data = mammals, 
     cex = 2,
     pch =2,
     lwd = 2, 
     xlab = "Log of animal body mass",
     ylab = "log(brain mass)")











Measures of variation











Misc. References

Websites www.biostat.wisc.edu/~kbroman/topten_worstgraphs/ www.americanscientist.org/issues/pub/population-growth-technology-and-tricky-graphs

Papers Wainer. 1984. How to Display Data Badly. Am. Statistician.