Statistical Graphics and Inference

The Good, The Bad, and The Misleading

January Academy 2016

Chester Ismay (Office: ETC 223)

http://blogs.reed.edu/datablog

http://blogs.reed.edu/ed-tech

The Importance of Data Visualization

Describe what you see when analyzing the summary information

datasource	x-mean	y-mean	x-stdev	y-stdev	correlation-xy
1	9	7.500909	3.316625	2.031568	0.8164205
2	9	7.500909	3.316625	2.031657	0.8162365
3	9	7.500000	3.316625	2.030424	0.8162867
4	9	7.500909	3.316625	2.030578	0.8165214

What do you think the plots of x versus y look like for the four datasets?

Visualizing this data

What have you learned from visualizing the data?

The raw Anscombe’s Quartet data

Data is in pairs (x1 goes with y1, x2 goes with y2, etc.)

x1	y1	x2	y2	x3	y3	x4	y4
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.10	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.10	4	5.39	19	12.50
12	10.84	12	9.13	12	8.15	8	5.56
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

Visualizations: The Bad

What’s wrong?

bad

What’s wrong?

bad2

2015 State of the Union Twitter heat map for Lower 48 US

sotu

Source: Glamour News

2010 Population Distribution in the US and Puerto Rico

popmap

Source: Census.gov

The Misleading

obama_fox

Common plots and
their uses

Hopefully, The Good

REVIEW: Variable types

Quantitative (Numeric)
- It is sensible to add, subtract, or take averages of the data values
- Continuous - numerical values in a range (fractions and decimals included)
- Discrete - numerical values with jumps (usually counts)
Qualitative (Categorical)
- Values designated by different groupings/categories (usually not numeric)

How plots can assist with INFERENTIAL STATISTICS

student

Setting up the hypotheses

What’s the null hypothesis (\(H_0\))?
- The population mean leniency score is the same for smile versus neutral faces
- \(H_0: \mu_s = \mu_n\) or \(H_0: \mu_s - \mu_n = 0\)
What’s the alternative hypothesis (\(H_a\))?
- The average leniency score is higher for smiling students than it is for students with a neutral facial expression.
- \(H_a: \mu_s > \mu_n\) or \(H_a: \mu_s - \mu_n > 0\)

Side-by-side Histograms

One continuous variable versus one categorical variable
Shows how many values are in different bins
Questions to ask:
- What is the shape? (Symmetric, skewed, etc.)
- How many peaks are there? (Unimodal, bimodal, etc.)
- How do the values vary?

Side-by-side Boxplots

One continuous variable versus one categorical variable
Vertical bars represent quantiles
Dots represent outliers

Do we have reason to believe,
based on the distributions of `leniency` scores
over these two `expression` groups,
that there is a significant increase
in the mean leniency score for
faces that `smile` compared to `neutral` faces?

sample_popn

Source: Statistics4u.info

For the Smile Leniency problem

expression	count	mean	sd
neutral	34	4.117647	1.522850
smile	34	4.911765	1.680866

What’s the sample?
- The 68 student pictures showing either smiling or neutral expressions
What’s the population?
- Ideally, ALL students with either smiling or neutral expressions assigned punishment after an infraction
We see that the sample mean leniency for those smiling, \(\bar{x}_s\), is greater than the similar measure for those with neutral faces, \(\bar{x}_n\).
But is it statistically significantly greater?

Assuming the null hypothesis is true…

Recall that we have \(H_0: \mu_s - \mu_n = 0\). What does this mean in laymen’s terms?
- We are assuming that there is no relationship in leniency score and type of facial expression.
We want to see if the difference in sample means could be explained by chance or if it is highly unlikely that chance is a good explanation for seeing a difference of that magnitude or larger.

The Chance Process

We can use a tactile point of view to explain what “chance” means here:
- Use \(n_n = 34\) blue index cards corresponding to neutral faces and \(n_s = 34\) red index cards corresponding to faces that smile. What’s next?
- Write the values of the corresponding leniency score on each of the index cards. What’s next?
- Put the two stacks of index cards together, creating a new set of 68 cards.

The Chance Process (continued)

We can use the index cards to create two new stacks for those smiling and those with neutral faces.
- First, we must shuffle all the cards thoroughly.
- After doing so, in this case with equal values of sample sizes, we split the deck in half.
- We then calculate the new sample mean leniency score of the smiling deck, and also the new sample mean leniency score of the neutral deck.
- This creates one simulation of the samples.
- We next want to calculate a statistic from these two samples.

The Chance Process (continued again)

We could do this actual shuffling and calculating, but it’s simpler to let the computer do mundane tasks.
Recall that the original sample mean difference is \(4.9117647 - 4.1176471 = 0.7941176\).
From our simulation we obtain a difference of \(-0.4117647\).
What do we do next?
- More simulations! Repeat this shuffling process, say, 10,000 times and then look at the distribution of the resulting sample mean differences.

What’s next?

The p-value!

Identify how many simulated mean differences are as extreme or more extreme as what we witnessed in our original sample mean difference.
- Here this value is 0.0244.
So only 2.44% of the simulated values are at or greater than what we saw with our original sample. Do we have evidence to reject \(H_0: \mu_s - \mu_n = 0\) in favor of \(H_a: \mu_s > \mu_n\)?
- Yes, our \(p\)-value of 0.0244 is a small proportion so we have little evidence in favor of the null hypothesis.

The p-value Visualized

But I thought this was a
\(t\)-test problem?

For the \(t\)-test, the test statistic is \(t = \dfrac{\bar{x}_s - \bar{x}_n}{\sqrt{\dfrac{{s_s}^2}{n_s} + \dfrac{{s_n}^2}{n_n}}}\).
So if we divided all of our simulated statistics by \({\sqrt{\frac{{s_s}^2}{n_s} + \frac{{s_n}^2}{n_n}}}\), we would be on the same scale as the \(t\).
What would the histogram of transformed values and corresponding \(t\) curve look like plotted together?

Results of `t.test`

## 
##  Welch Two Sample t-test
## 
## data:  leniency by expression
## t = 2.0415, df = 65.367, p-value = 0.02262
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1451043       Inf
## sample estimates:
##   mean in group smile mean in group neutral 
##              4.911765              4.117647

Tying it together

The \(t\) distribution and also the normal distribution were developed to approximate the “by chance” distribution.
When they were developed, scientists/statisticians couldn’t replicate the simulation 10,000 times like I have here.
That’s why the t.test comes with lots of assumptions:
- Sample sizes must be larger than 30
- If sample sizes are not large enough, we need to assume the population distributions are normal.
  - That’s a BIG assumption!

Another example of INFERENTIAL STATISTICS and Plotting

Does spending extra time with non-academic activities hinder academic performance?

Scatterplots

Two continuous variables
Questions to ask:
- Pattern - Positive/No/Negative relation
- Form - Linear/Non-linear
- Strength - Closely matches fitted line?

How was this “line of best fit” obtained?

Least squares estimation

Given paired data \((x_1, y_1)\), \((x_2, y_2)\), \(\cdots\), \((x_n, y_n)\), we model via the equation \(y_j = \beta_0 + \beta_1 x_j + \varepsilon_j\).
We want to minimize the sum of squared error terms (\(\varepsilon_j\)). Solving for \(\varepsilon_j\), we minimize (using partial derivatives from calculus) \[ \sum_{j=1}^n (y_j - \beta_0 - \beta_1 x_j)^2. \]

Statistical software to the rescue!

The equation of the fitted line is

\[ \hat{gpa} = 3.598 -0.006 \times hours \]

So…Is `hours` doing non-academic things a SIGNIFICANT NEGATIVE predictor of `gpa`?

\(H_0 : \beta_1 = 0\) versus \(H_a: \beta_1 < 0\)
Our sample slope is -0.006. Is this far enough from zero to conclude that this observed slope is significant?
We can do the same sort of resampling here by shuffling all of the response variable values and assigning them to each of the values of the explanatory variable.

More shuffling!

We assume that hours and gpa are not related.
Any \(x\) value could match up with any \(y\) value.
We shuffle the values of gpa and assign them to values of hours.
We calculate the simulated slope statistic.
We then repeat this process, say, 20,000 times and then see where our observed slope statistic \(\hat{\beta}_1\) falls on that distribution.

The \(p\)-value is 0.0323 so we reject the null hypothesis.
Based on this sample of university students, we have evidence to conclude that hours doing non-academic things is a significant negative predictor of gpa.
A similar transformation shows that these results can be well-approximated by a \(t\) distribution.

Data Visualization Examples
on the Web

What can I help you with?

Data analysis
Data wrangling/cleaning
Data visualization
Data tidying/manipulating
Reproducible research

When am I available?

Email me at cismay@reed.edu or chester.ismay@reed.edu
Tentative Spring 2016 office (ETC 223) hours
- Mondays (10 AM to 11 AM)
- Tuesdays (2 PM to 3 PM)
- Wednesdays (1:30 PM to 2:30 PM)
- Fridays (1:30 PM to 2:30 PM)
Sometimes available for virtual office hours via Google Hangouts (email me for details)

Thanks!

cismay@reed.edu

Slides available at http://rpubs.com/cismay/ja_2016

Code for slide creation on my GitHub page

Statistical Graphics and Inference

The Good, The Bad, and The Misleading

January Academy 2016

The Importance of Data Visualization

Describe what you see when analyzing the summary information

Visualizing this data

The raw Anscombe’s Quartet data

Visualizations: The Bad

What’s wrong?

What’s wrong?

The Misleading

Common plots and their uses

Hopefully, The Good

REVIEW: Variable types

How plots can assist with INFERENTIAL STATISTICS

Setting up the hypotheses

Side-by-side Histograms

Side-by-side Boxplots

Do we have reason to believe, based on the distributions of leniency scores over these two expression groups, that there is a significant increase in the mean leniency score for faces that smile compared to neutral faces?

For the Smile Leniency problem

Assuming the null hypothesis is true…

The Chance Process

The Chance Process (continued)

The Chance Process (continued again)

The p-value!

The p-value Visualized

But I thought this was a \(t\)-test problem?

Results of t.test

Tying it together

Another example of INFERENTIAL STATISTICS and Plotting

Does spending extra time with non-academic activities hinder academic performance?

Scatterplots

How was this “line of best fit” obtained?

Least squares estimation

Statistical software to the rescue!

So…Is hours doing non-academic things a SIGNIFICANT NEGATIVE predictor of gpa?

More shuffling!

Data Visualization Examples on the Web

Links

What can I help you with?

When am I available?

Thanks!

cismay@reed.edu

Common plots and
their uses

Do we have reason to believe,
based on the distributions of `leniency` scores
over these two `expression` groups,
that there is a significant increase
in the mean leniency score for
faces that `smile` compared to `neutral` faces?

But I thought this was a
\(t\)-test problem?

Results of `t.test`

So…Is `hours` doing non-academic things a SIGNIFICANT NEGATIVE predictor of `gpa`?

Data Visualization Examples
on the Web