The Importance of Data Visualization
|
datasource
|
x-mean
|
y-mean
|
x-stdev
|
y-stdev
|
correlation-xy
|
|
1
|
9
|
7.500909
|
3.316625
|
2.031568
|
0.8164205
|
|
2
|
9
|
7.500909
|
3.316625
|
2.031657
|
0.8162365
|
|
3
|
9
|
7.500000
|
3.316625
|
2.030424
|
0.8162867
|
|
4
|
9
|
7.500909
|
3.316625
|
2.030578
|
0.8165214
|
- What do you think the plots of
x versus y look like for the four datasets?
Visualizing this data

- What have you learned from visualizing the data?
The raw Anscombe’s Quartet data
Data is in pairs (x1 goes with y1, x2 goes with y2, etc.)
|
x1
|
y1
|
x2
|
y2
|
x3
|
y3
|
x4
|
y4
|
|
10
|
8.04
|
10
|
9.14
|
10
|
7.46
|
8
|
6.58
|
|
8
|
6.95
|
8
|
8.14
|
8
|
6.77
|
8
|
5.76
|
|
13
|
7.58
|
13
|
8.74
|
13
|
12.74
|
8
|
7.71
|
|
9
|
8.81
|
9
|
8.77
|
9
|
7.11
|
8
|
8.84
|
|
11
|
8.33
|
11
|
9.26
|
11
|
7.81
|
8
|
8.47
|
|
14
|
9.96
|
14
|
8.10
|
14
|
8.84
|
8
|
7.04
|
|
6
|
7.24
|
6
|
6.13
|
6
|
6.08
|
8
|
5.25
|
|
4
|
4.26
|
4
|
3.10
|
4
|
5.39
|
19
|
12.50
|
|
12
|
10.84
|
12
|
9.13
|
12
|
8.15
|
8
|
5.56
|
|
7
|
4.82
|
7
|
7.26
|
7
|
6.42
|
8
|
7.91
|
|
5
|
5.68
|
5
|
4.74
|
5
|
5.73
|
8
|
6.89
|
What’s wrong?

What’s wrong?

2015 State of the Union Twitter heat map for Lower 48 US

Source: Glamour News
2010 Population Distribution in the US and Puerto Rico

Source: Census.gov
Common plots and
their uses
Hopefully, The Good
REVIEW: Variable types
- Quantitative (Numeric)
- It is sensible to add, subtract, or take averages of the data values
- Continuous - numerical values in a range (fractions and decimals included)
- Discrete - numerical values with jumps (usually counts)
- Qualitative (Categorical)
- Values designated by different groupings/categories (usually not numeric)
How plots can assist with INFERENTIAL STATISTICS
Setting up the hypotheses
- What’s the null hypothesis (\(H_0\))?
- The population mean
leniency score is the same for smile versus neutral faces
- \(H_0: \mu_s = \mu_n\) or \(H_0: \mu_s - \mu_n = 0\)
- What’s the alternative hypothesis (\(H_a\))?
- The average
leniency score is higher for smiling students than it is for students with a neutral facial expression.
- \(H_a: \mu_s > \mu_n\) or \(H_a: \mu_s - \mu_n > 0\)
Side-by-side Histograms

- One continuous variable versus one categorical variable
- Shows how many values are in different bins
- Questions to ask:
- What is the shape? (Symmetric, skewed, etc.)
- How many peaks are there? (Unimodal, bimodal, etc.)
- How do the values vary?
Side-by-side Boxplots

- One continuous variable versus one categorical variable
- Vertical bars represent quantiles
- Dots represent outliers
Do we have reason to believe,
based on the distributions of leniency scores
over these two expression groups,
that there is a significant increase
in the mean leniency score for
faces that smile compared to neutral faces?
For the Smile Leniency problem
| neutral |
34 |
4.117647 |
1.522850 |
| smile |
34 |
4.911765 |
1.680866 |
- What’s the sample?
- The 68 student pictures showing either smiling or neutral expressions
- What’s the population?
- Ideally, ALL students with either smiling or neutral expressions assigned punishment after an infraction
We see that the sample mean leniency for those smiling, \(\bar{x}_s\), is greater than the similar measure for those with neutral faces, \(\bar{x}_n\).
But is it statistically significantly greater?
Assuming the null hypothesis is true…
- Recall that we have \(H_0: \mu_s - \mu_n = 0\). What does this mean in laymen’s terms?
- We are assuming that there is no relationship in
leniency score and type of facial expression.
- We want to see if the difference in sample means could be explained by chance or if it is highly unlikely that chance is a good explanation for seeing a difference of that magnitude or larger.
The Chance Process
- We can use a tactile point of view to explain what “chance” means here:
- Use \(n_n = 34\) blue index cards corresponding to
neutral faces and \(n_s = 34\) red index cards corresponding to faces that smile. What’s next?
- Write the values of the corresponding
leniency score on each of the index cards. What’s next?
- Put the two stacks of index cards together, creating a new set of 68 cards.
The Chance Process (continued)
- We can use the index cards to create two new stacks for those smiling and those with neutral faces.
- First, we must shuffle all the cards thoroughly.
- After doing so, in this case with equal values of sample sizes, we split the deck in half.
- We then calculate the new sample mean leniency score of the smiling deck, and also the new sample mean leniency score of the neutral deck.
- This creates one simulation of the samples.
- We next want to calculate a statistic from these two samples.
The Chance Process (continued again)
We could do this actual shuffling and calculating, but it’s simpler to let the computer do mundane tasks.
Recall that the original sample mean difference is \(4.9117647 - 4.1176471 = 0.7941176\).
From our simulation we obtain a difference of \(-0.4117647\).
- What do we do next?
- More simulations! Repeat this shuffling process, say, 10,000 times and then look at the distribution of the resulting sample mean differences.
The p-value!
- Identify how many simulated mean differences are as extreme or more extreme as what we witnessed in our original sample mean difference.
- Here this value is 0.0244.
- So only 2.44% of the simulated values are at or greater than what we saw with our original sample. Do we have evidence to reject \(H_0: \mu_s - \mu_n = 0\) in favor of \(H_a: \mu_s > \mu_n\)?
- Yes, our \(p\)-value of 0.0244 is a small proportion so we have little evidence in favor of the null hypothesis.
The p-value Visualized

But I thought this was a
\(t\)-test problem?
For the \(t\)-test, the test statistic is \(t = \dfrac{\bar{x}_s - \bar{x}_n}{\sqrt{\dfrac{{s_s}^2}{n_s} + \dfrac{{s_n}^2}{n_n}}}\).
So if we divided all of our simulated statistics by \({\sqrt{\frac{{s_s}^2}{n_s} + \frac{{s_n}^2}{n_n}}}\), we would be on the same scale as the \(t\).
What would the histogram of transformed values and corresponding \(t\) curve look like plotted together?
Results of t.test
##
## Welch Two Sample t-test
##
## data: leniency by expression
## t = 2.0415, df = 65.367, p-value = 0.02262
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.1451043 Inf
## sample estimates:
## mean in group smile mean in group neutral
## 4.911765 4.117647
Tying it together
The \(t\) distribution and also the normal distribution were developed to approximate the “by chance” distribution.
When they were developed, scientists/statisticians couldn’t replicate the simulation 10,000 times like I have here.
- That’s why the
t.test comes with lots of assumptions:
- Sample sizes must be larger than 30
- If sample sizes are not large enough, we need to assume the population distributions are normal.
- That’s a BIG assumption!
Another example of INFERENTIAL STATISTICS and Plotting
Scatterplots

- Two continuous variables
- Questions to ask:
- Pattern - Positive/No/Negative relation
- Form - Linear/Non-linear
- Strength - Closely matches fitted line?
How was this “line of best fit” obtained?

Least squares estimation
- Given paired data \((x_1, y_1)\), \((x_2, y_2)\), \(\cdots\), \((x_n, y_n)\), we model via the equation \(y_j = \beta_0 + \beta_1 x_j + \varepsilon_j\).
- We want to minimize the sum of squared error terms (\(\varepsilon_j\)). Solving for \(\varepsilon_j\), we minimize (using partial derivatives from calculus) \[
\sum_{j=1}^n (y_j - \beta_0 - \beta_1 x_j)^2.
\]
Statistical software to the rescue!

The equation of the fitted line is
\[
\hat{gpa} = 3.598 -0.006 \times hours
\]
So…Is hours doing non-academic things a SIGNIFICANT NEGATIVE predictor of gpa?
\(H_0 : \beta_1 = 0\) versus \(H_a: \beta_1 < 0\)
Our sample slope is -0.006. Is this far enough from zero to conclude that this observed slope is significant?
We can do the same sort of resampling here by shuffling all of the response variable values and assigning them to each of the values of the explanatory variable.
More shuffling!
- We assume that
hours and gpa are not related.
- Any \(x\) value could match up with any \(y\) value.
- We shuffle the values of
gpa and assign them to values of hours.
- We calculate the simulated slope statistic.
- We then repeat this process, say, 20,000 times and then see where our observed slope statistic \(\hat{\beta}_1\) falls on that distribution.

- The \(p\)-value is 0.0323 so we reject the null hypothesis.
- Based on this sample of university students, we have evidence to conclude that
hours doing non-academic things is a significant negative predictor of gpa.
- A similar transformation shows that these results can be well-approximated by a \(t\) distribution.
Data Visualization Examples
on the Web
What can I help you with?
- Data analysis
- Data wrangling/cleaning
- Data visualization
- Data tidying/manipulating
- Reproducible research
When am I available?
- Email me at cismay@reed.edu or chester.ismay@reed.edu
- Tentative Spring 2016 office (ETC 223) hours
- Mondays (10 AM to 11 AM)
- Tuesdays (2 PM to 3 PM)
- Wednesdays (1:30 PM to 2:30 PM)
- Fridays (1:30 PM to 2:30 PM)
- Sometimes available for virtual office hours via Google Hangouts (email me for details)