(Slide 1)

Lecture 4: Displaying & Describing Data

Chapters 2 & 3, Whitlock and Schulter, 2nd Ed

Key questions:

What makes a good graph?

What makes a bad graph?

Types of graphs for types of different data

Describing “central tendency”: mean & median

Describing variation: variance & stdev

(Slide 2)

Basics Rules of plotting data

1) Show the raw data if possible

2) Show distributional info if possible

3) ALWAYS Include error bars around means

4) ALWAYS Include error bars around means

5) Make patterns in the data easy to see

6) Represent magnitude honestly

7) Draw graphical elements clearly

8) Include a legend and label things clearly

(Slide 3.5)

Plotting Rule #1: Show the raw data

For the next several slides we will consider this classic dataset made popular by R.A. Fisher

Fisher’s irises

(Slide 3.75)

Plotting Rule #1: Show the raw data

Compare the information content of these two graphs

(Slide 3.85)

Plotting Rule #1: Show the Data

Multiple datasets can result in the exact same barplot & errorbars

All of these datasets result in the exact same mean, SD, and SE.

Bar plots - even with error bars - therefore can reveal very little about the data

Additional distributional information is therefore needed

barplots

(slide 4)

Plotting Rule #2: Show distributional information

Distributional info can be displayed by

-plotting the raw data

-using boxplots

(slide 4.5)

Plotting Rule #2: Show distributional information

“Jittering” with plotting raw data

(slide 5)

Plotting Rule #3 & 4: ALWAYS use error bars for means

-Means MUST always have an estimate of uncertainty around them

-The range doesn’t count!

-Typically use “standard error” OR “confidence interval”

-Rarely use “standard deviaiton”

(Slide 6)

Plotting Rule #5: Make Patterns Easy to See

Keep it as simple as possible

Add labels, annotations etc. to the plot

Use both color AND pattern/shape to distinguish groups

Avoid 3D

Don’t use most of the fancy stuff in Excel!

Plot from Susan Kalisz et al 2014 PNAS

It is very difficult to determine the actual values of this plot

3D barplots

(Slide 6.5)

Plotting Rule #5: Make Patterns Easy to See example

This plot has

-what the error bars are is clearly indicated in the plot

-Sample size is also indicated

-This information could be in a caption, but is even easier to find in plot

-All of this can be done in R, but often easier to annotate plots in Power Point

(Slide 7)

Plotting Rule #6: Represent Magnitudes Honestly

This plot emphasizes a certain aspect of data

Some critics think this is desceptive

Misleading barplots-magnitude

What else is missing?

Original Paper: aeaweb.org/articles?id=10.1257/jep.25.1.159 Oreopoulos & Salvanes. 2011. Priceless: The Nonpecuniary Benefits of Schooling. Jrn Econ Persp.

Critiques: econlog.econlib.org/archives/2011/07/job_satisfactio.html scienceblogs.com/principles/2011/07/10/great-moments-in-deceptive-gra/ See statisticshowto.com/misleading-graphs/ for some more interesting examples

(Slide 8)

Histograms of discrete data

Number of extinct birds from each Hawaiian island

-The shows the frequency of each category

-Error bars are not possible w/these data

-This type of plot is useful for general descriptions of data

(Slide 9)

Histograms of CONTINUOUS data

Histograms are very useful when you

-explore datasets you are seeing for the 1st time

-display data that is skewed or oddly shaped

-Convey similar idea as boxplot

-vertical lines often added to show mean, median, etc

-made with “hist()”" in R

This plot shows the distribution of birthweights from a set of babies born in a hospital. Jude Hendrik Brouwer is shown with the red line. How could this plot be improved?

(Slide 10)

Graphing association between “categorical”" variables

aka, Turning a contingency table into a graph (Example 2.3A in textbook)

How could this plot be improved?

What would we do if we had 3+ years of data?

From: Chitwood et al ’15 Do Biological & Bedsite Characteristics Influence Surv. of Neonatal White-Tailed Deer?. PLoS ONE doi:10.1371/journal.pone.0119070

(Slide 11)

2 Continuous variables: Scatter Plots

Scatter plots are standard way of for plotting two continous variables against each other

Frequently used to visualize data for “regression” analysis

We frequently take the log of numerical data - more on this later

(Slide 12)

Numerical responses vs categorical variables

Frequently used w/ t-tests, “ANOVA”

X-axis is some kind of category that groups the data (treatments, years, species)

Plotting raw data useful when there is a small to moderate amount of data (say <50)

Box plots better when ther is lots of data

Both mean and median can be used

Sometimes plot raw data along w/error bars

(Slide 12.5)

Boxplots display distributional information

Median: measure of “central tendency”

-the exact middle of the data

Box = “Interquartile range”

“Wiskers” extend to large/small-est non-extreme values

“Outliers” marked as dots

Very difficutl to make in Excel

Very easy in R

Very useful for exploring data

The mean & median are often similar

But not always!

The mean is the “center of mass”

Median is “midpoint”

-see sxn 3.3 of book for good discusion of this

Most statistical techniques, no matter how complicated, are calculating 2+ means of some kind

Most statistical test, no matter how complicated, are comparing 2+ means somehow

(Slide 13)

Trends over time

Song Sparrow (Melospiza melodia) counts in Darrtown, OH, USA. From USGUS Breeding Bird Survey (BBS)

(Slide 13.5)

Trends over time

The slope of “best fit” line through these points is an estimate the mean rate of change in the num. of birds over time.

This relates to the topic known as “Regression”

Song Sparrow (Melospiza melodia) counts in Darrtown, OH, USA. From USGUS Breeding Bird Survey (BBS)

(Slide 14)

Maps can display data

Maps ’r pretty

R can make maps

Might try to make lab on maps and GIS…

See: http://www.molecularecologist.com/2012/09/making-maps-with-r/

(Slide 15)

Tables vs. Graphs?

Graphs are generally better than tables

Tables should follow same principals as graphs

Tables are best for highly detailed info (eg p values, t statistics)

Many papers now include tables of raw data in an appendix

See Gelman et al. 2002. Let’s Practice What We Preach: Turning Tables into Graphs Am. Stat.

(Slide 15)

Measures of variation

(Slide 16)

Misc. References

Websites www.biostat.wisc.edu/~kbroman/topten_worstgraphs/ www.americanscientist.org/issues/pub/population-growth-technology-and-tricky-graphs

Papers Wainer. 1984. How to Display Data Badly. Am. Statistician.

Lecture 4: Displaying Data

brouwern@gmail.com

August 23, 2016

(Slide 1)

Lecture 4: Displaying & Describing Data

Chapters 2 & 3, Whitlock and Schulter, 2nd Ed

Key questions:

What makes a good graph?

What makes a bad graph?

Types of graphs for types of different data

Describing “central tendency”: mean & median

Describing variation: variance & stdev

(Slide 2)

Basics Rules of plotting data

1) Show the raw data if possible

2) Show distributional info if possible

3) ALWAYS Include error bars around means

4) ALWAYS Include error bars around means

5) Make patterns in the data easy to see

6) Represent magnitude honestly

7) Draw graphical elements clearly

8) Include a legend and label things clearly

(Slide 3.5)

Plotting Rule #1: Show the raw data

For the next several slides we will consider this classic dataset made popular by R.A. Fisher

(Slide 3.75)

Plotting Rule #1: Show the raw data

Compare the information content of these two graphs

(Slide 3.85)

Plotting Rule #1: Show the Data

Multiple datasets can result in the exact same barplot & errorbars

All of these datasets result in the exact same mean, SD, and SE.

Bar plots - even with error bars - therefore can reveal very little about the data

Additional distributional information is therefore needed

(slide 4)

Plotting Rule #2: Show distributional information

Distributional info can be displayed by

-plotting the raw data

-using boxplots

(slide 4.5)

Plotting Rule #2: Show distributional information

“Jittering” with plotting raw data

(slide 5)

Plotting Rule #3 & 4: ALWAYS use error bars for means

-Means MUST always have an estimate of uncertainty around them

-The range doesn’t count!

-Typically use “standard error” OR “confidence interval”

-Rarely use “standard deviaiton”

(Slide 6)

Plotting Rule #5: Make Patterns Easy to See

Keep it as simple as possible

Add labels, annotations etc. to the plot

Use both color AND pattern/shape to distinguish groups

Avoid 3D

Use color-blind friendly palettes

Don’t use most of the fancy stuff in Excel!

Plot from Susan Kalisz et al 2014 PNAS

It is very difficult to determine the actual values of this plot

(Slide 6.5)

Plotting Rule #5: Make Patterns Easy to See example

This plot has

-what the error bars are is clearly indicated in the plot

-Sample size is also indicated

-This information could be in a caption, but is even easier to find in plot

-All of this can be done in R, but often easier to annotate plots in Power Point

(Slide 7)

Plotting Rule #6: Represent Magnitudes Honestly

This plot emphasizes a certain aspect of data

Some critics think this is desceptive

What else is missing?

(Slide 8)

Histograms of discrete data

Number of extinct birds from each Hawaiian island

-The shows the frequency of each category

-Error bars are not possible w/these data

-This type of plot is useful for general descriptions of data

(Slide 9)

Histograms of CONTINUOUS data

Histograms are very useful when you

-explore datasets you are seeing for the 1st time