Statistical Graphics and Inference

The Good, The Bad, and The Misleading


January Academy 2016


Chester Ismay (Office: ETC 223)

cismay@reed.edu

http://blogs.reed.edu/datablog

http://blogs.reed.edu/ed-tech

The Importance of Data Visualization

Describe what you see when analyzing the summary information

datasource x-mean y-mean x-stdev y-stdev correlation-xy
1 9 7.500909 3.316625 2.031568 0.8164205
2 9 7.500909 3.316625 2.031657 0.8162365
3 9 7.500000 3.316625 2.030424 0.8162867
4 9 7.500909 3.316625 2.030578 0.8165214

  • What do you think the plots of x versus y look like for the four datasets?

Visualizing this data

  • What have you learned from visualizing the data?

The raw Anscombe’s Quartet data

Data is in pairs (x1 goes with y1, x2 goes with y2, etc.)

x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

Visualizations: The Bad

What’s wrong?

bad

What’s wrong?

bad2

2015 State of the Union Twitter heat map for Lower 48 US

sotu

Source: Glamour News

2010 Population Distribution in the US and Puerto Rico

popmap

Source: Census.gov

The Misleading

obama_fox

Common plots and
their uses

Hopefully, The Good

REVIEW: Variable types

  • Quantitative (Numeric)
    • It is sensible to add, subtract, or take averages of the data values
    • Continuous - numerical values in a range (fractions and decimals included)
    • Discrete - numerical values with jumps (usually counts)
  • Qualitative (Categorical)
    • Values designated by different groupings/categories (usually not numeric)

How plots can assist with INFERENTIAL STATISTICS

student

Setting up the hypotheses

  • What’s the null hypothesis (\(H_0\))?
    • The population mean leniency score is the same for smile versus neutral faces
    • \(H_0: \mu_s = \mu_n\) or \(H_0: \mu_s - \mu_n = 0\)
  • What’s the alternative hypothesis (\(H_a\))?
    • The average leniency score is higher for smiling students than it is for students with a neutral facial expression.
    • \(H_a: \mu_s > \mu_n\) or \(H_a: \mu_s - \mu_n > 0\)

Side-by-side Histograms

  • One continuous variable versus one categorical variable
  • Shows how many values are in different bins
  • Questions to ask:
    • What is the shape? (Symmetric, skewed, etc.)
    • How many peaks are there? (Unimodal, bimodal, etc.)
    • How do the values vary?

Side-by-side Boxplots

  • One continuous variable versus one categorical variable
  • Vertical bars represent quantiles
  • Dots represent outliers

Do we have reason to believe,
based on the distributions of leniency scores
over these two expression groups,
that there is a significant increase
in the mean leniency score for
faces that smile compared to neutral faces?

sample_popn

Source: Statistics4u.info

For the Smile Leniency problem

expression count mean sd
neutral 34 4.117647 1.522850
smile 34 4.911765 1.680866
  • What’s the sample?
    • The 68 student pictures showing either smiling or neutral expressions
  • What’s the population?
    • Ideally, ALL students with either smiling or neutral expressions assigned punishment after an infraction
  • We see that the sample mean leniency for those smiling, \(\bar{x}_s\), is greater than the similar measure for those with neutral faces, \(\bar{x}_n\).

  • But is it statistically significantly greater?

Assuming the null hypothesis is true…

  • Recall that we have \(H_0: \mu_s - \mu_n = 0\). What does this mean in laymen’s terms?
    • We are assuming that there is no relationship in leniency score and type of facial expression.
  • We want to see if the difference in sample means could be explained by chance or if it is highly unlikely that chance is a good explanation for seeing a difference of that magnitude or larger.

The Chance Process

  • We can use a tactile point of view to explain what “chance” means here:
    • Use \(n_n = 34\) blue index cards corresponding to neutral faces and \(n_s = 34\) red index cards corresponding to faces that smile. What’s next?
    • Write the values of the corresponding leniency score on each of the index cards. What’s next?
    • Put the two stacks of index cards together, creating a new set of 68 cards.

The Chance Process (continued)

  • We can use the index cards to create two new stacks for those smiling and those with neutral faces.
    • First, we must shuffle all the cards thoroughly.
    • After doing so, in this case with equal values of sample sizes, we split the deck in half.
    • We then calculate the new sample mean leniency score of the smiling deck, and also the new sample mean leniency score of the neutral deck.
    • This creates one simulation of the samples.
    • We next want to calculate a statistic from these two samples.

The Chance Process (continued again)

  • We could do this actual shuffling and calculating, but it’s simpler to let the computer do mundane tasks.

  • Recall that the original sample mean difference is \(4.9117647 - 4.1176471 = 0.7941176\).

  • From our simulation we obtain a difference of \(-0.4117647\).

  • What do we do next?
    • More simulations! Repeat this shuffling process, say, 10,000 times and then look at the distribution of the resulting sample mean differences.

  • What’s next?

The p-value!

  • Identify how many simulated mean differences are as extreme or more extreme as what we witnessed in our original sample mean difference.
    • Here this value is 0.0244.
  • So only 2.44% of the simulated values are at or greater than what we saw with our original sample. Do we have evidence to reject \(H_0: \mu_s - \mu_n = 0\) in favor of \(H_a: \mu_s > \mu_n\)?
    • Yes, our \(p\)-value of 0.0244 is a small proportion so we have little evidence in favor of the null hypothesis.

The p-value Visualized

But I thought this was a
\(t\)-test problem?

  • For the \(t\)-test, the test statistic is \(t = \dfrac{\bar{x}_s - \bar{x}_n}{\sqrt{\dfrac{{s_s}^2}{n_s} + \dfrac{{s_n}^2}{n_n}}}\).

  • So if we divided all of our simulated statistics by \({\sqrt{\frac{{s_s}^2}{n_s} + \frac{{s_n}^2}{n_n}}}\), we would be on the same scale as the \(t\).

  • What would the histogram of transformed values and corresponding \(t\) curve look like plotted together?

Results of t.test

## 
##  Welch Two Sample t-test
## 
## data:  leniency by expression
## t = 2.0415, df = 65.367, p-value = 0.02262
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1451043       Inf
## sample estimates:
##   mean in group smile mean in group neutral 
##              4.911765              4.117647

Tying it together

  • The \(t\) distribution and also the normal distribution were developed to approximate the “by chance” distribution.

  • When they were developed, scientists/statisticians couldn’t replicate the simulation 10,000 times like I have here.

  • That’s why the t.test comes with lots of assumptions:
    • Sample sizes must be larger than 30
    • If sample sizes are not large enough, we need to assume the population distributions are normal.
      • That’s a BIG assumption!

Another example of INFERENTIAL STATISTICS and Plotting


Does spending extra time with non-academic activities hinder academic performance?

Scatterplots

  • Two continuous variables
  • Questions to ask:
    • Pattern - Positive/No/Negative relation
    • Form - Linear/Non-linear
    • Strength - Closely matches fitted line?

How was this “line of best fit” obtained?

Least squares estimation

  • Given paired data \((x_1, y_1)\), \((x_2, y_2)\), \(\cdots\), \((x_n, y_n)\), we model via the equation \(y_j = \beta_0 + \beta_1 x_j + \varepsilon_j\).
  • We want to minimize the sum of squared error terms (\(\varepsilon_j\)). Solving for \(\varepsilon_j\), we minimize (using partial derivatives from calculus) \[ \sum_{j=1}^n (y_j - \beta_0 - \beta_1 x_j)^2. \]

Statistical software to the rescue!

The equation of the fitted line is

\[ \hat{gpa} = 3.598 -0.006 \times hours \]

So…Is hours doing non-academic things a SIGNIFICANT NEGATIVE predictor of gpa?

  • \(H_0 : \beta_1 = 0\) versus \(H_a: \beta_1 < 0\)

  • Our sample slope is -0.006. Is this far enough from zero to conclude that this observed slope is significant?

  • We can do the same sort of resampling here by shuffling all of the response variable values and assigning them to each of the values of the explanatory variable.

More shuffling!

  • We assume that hours and gpa are not related.
  • Any \(x\) value could match up with any \(y\) value.
  • We shuffle the values of gpa and assign them to values of hours.
  • We calculate the simulated slope statistic.
  • We then repeat this process, say, 20,000 times and then see where our observed slope statistic \(\hat{\beta}_1\) falls on that distribution.

  • The \(p\)-value is 0.0323 so we reject the null hypothesis.
  • Based on this sample of university students, we have evidence to conclude that hours doing non-academic things is a significant negative predictor of gpa.
  • A similar transformation shows that these results can be well-approximated by a \(t\) distribution.

Data Visualization Examples
on the Web

What can I help you with?

  • Data analysis
  • Data wrangling/cleaning
  • Data visualization
  • Data tidying/manipulating
  • Reproducible research

When am I available?

  • Email me at cismay@reed.edu or chester.ismay@reed.edu
  • Tentative Spring 2016 office (ETC 223) hours
    • Mondays (10 AM to 11 AM)
    • Tuesdays (2 PM to 3 PM)
    • Wednesdays (1:30 PM to 2:30 PM)
    • Fridays (1:30 PM to 2:30 PM)
  • Sometimes available for virtual office hours via Google Hangouts (email me for details)

Thanks!

cismay@reed.edu



Slides available at http://rpubs.com/cismay/ja_2016

Code for slide creation on my GitHub page