February 4, 2013 Class Notes

Announcements

Questions about height

It's not silly to ask, “How tall are you?”

But it is silly to ask, “Why are you the height that you are?”

Or, when comparing two people, “Why are your heights different in the way they are?”

Why are these questions silly?

The statistical question is

What accounts for the variation of height from person to person in a group of people?

Some possible factors:

There are still lots of potential theories that might account for height in terms of these factors, but with a large group the possible theories will be severely constrained by the data.

Task for today

Ways to describe variation and, particularly, ways to quantify it.

g = fetchData("Galton")
## Data Galton found in package.

On Friday, we looked at various graphical displays

Explain the density plot by reference to the dotplot of points at the bottom of the figure. The height of the graph shows how dense the points are.

More technically, imagine a graph — the cumulative distribution function — showing the fraction of cases that fall below a given value versus that given value. This will be a upward stepping graph, like this:

plot(ecdf(g$height))  # not a command the students need to know

plot of chunk unnamed-chunk-3

The density plot is the derivative of this graph. Note that if you integrate the derivative from \( -\infty \) to \( \infty \), you'll get 1, the total of increase in the original function. In other words, the area under the entire density curve is 1. That's what determines the units on the vertical axis of the density curve.

When to use each:

bwplot(height ~ sex, data = g)

plot of chunk unnamed-chunk-4

What's an outlier.

The 1.5 IQR rule of thumb. Demo this by showing a box-and-whisker plot with some outliers and showing how the whiskers extend to 1.5 IQR from the first and third quartiles.

Numerical summaries of distributions

A number describes one thing. You need to know what is the one thing that each of these numbers is describing in terms of a distribution:

Some numerical descriptions involve two numbers:

Computing these

The operators are mean, median, sd, var, IQR, qdata, max, min, range, confint

Examples:

mean(height, data = g)
## [1] 66.76
sd(height, data = g)
## [1] 3.583

Most of the operators we will use throughout the course will work this way.

Some of the operators don't recognize the data= syntax. We'd like to fix this, but it isn't always possible without a re-organization of base software, which is too widely used to change. This is the downside of using professional-level software — it's not oriented toward beginners. On the other hand, once you learn to use it, you'll be able to do professional things.

These other operators want to work on a variable that's already been extracted from a data frame, they don't do the extraction themselves.

IQR(g$height)
## [1] 5.7

I'll tend to use this syntax:

with(data = g, IQR(height))
## [1] 5.7
with(data = g, confint(father))
## [1] 64.38 74.08

Interpreting these Measures

It can be helpful to think of arranging these various quantities along two conceptual dimensions:

Location and Scatter

Inclusion and robustness

Computational Exercise: Testing Robustness

Technique: Creating an Outlier

Let's change one height value in the Galton data to be an outrageous outlier:

bogus = g
bogus$height[25] = 800  # You don't need to use this command

Play with the various measures and see which ones are very different between g and bogus.

Dealing with Outliers

Order statistics

Take everyone out into the hall and line them up from shortest to tallest. Assign each person a rank, which is just their order in the line. When there is a tie, the order of the people involved in the tie is arbitrary, so average the naive ranks that would be assigned to the people involved in the tie.

Point out the min, max, median, first and third quantile.

The Variance and Standard Deviation

Sometimes it's nice to be able to summarize a distribution with just a small set of numbers. Some possibilities:

We'll be making extensive use of the mean and standard deviation in this course. The reason to prefer these won't become apparent until a few weeks into the semester.

“Standard Deviation” -> “Typical Spread” in a more modern terminology.

In French, it's literally “typical spread”: ėcart type

Give the formulas for mean and standard deviation \[ m = \frac{1}{n}\sum_{k=1}^{n} x_k \] \[ v = \frac{1}{n-1} \sum_{k=1}^{n} (x_k - m)^2 \] \[ s = \sqrt{v} \]

Eyeballing: Standard deviation on a bell-shaped distribution: more or less the half-width at half-height.

Units of m and s.

Examples of estimation of s.

In-class activity on the properties of the various measures.

Measurement and measurement bias

Sampling and sampling bias

Random sampling

In-Class Activity

fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE

Instructor's write up

What do your fellow students know about statistics

ks = fetchData("/Users/kaplan/Dropbox/Stat155Fall2012/knowledge-survey-2012-09-11.csv")
## Complete file name given.  No searching necessary.
for (item in levels(ks$ProblemItem)) {
    print(item)
    print(tally(~as.character(AnswerContents), data = subset(ks, ProblemItem == 
        item)))
    print("")
}
## [1] "algebra"
## 
##         10-12 AP\n   course       College         Never         Total 
##             7             1            35            21            64 
## [1] ""
## [1] "ANCOVA"
## 
##  None  Some Total 
##    48    16    64 
## [1] ""
## [1] "ANOVA"
## 
## Complete     None     Some    Total 
##        1       32       31       64 
## [1] ""
## [1] "bayes"
## 
## Complete     None     Some    Total 
##        4       47       12       63 
## [1] ""
## [1] "boot"
## 
##  None  Some Total 
##    61     2    63 
## [1] ""
## [1] "bwplot"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##             9            15            23             4             2 
##         Never         Total 
##            11            64 
## [1] ""
## [1] "calc"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##            20             3            31             4             1 
##         Never         Total 
##             4            63 
## [1] ""
## [1] "coll"
## 
## Complete     None     Some    Total 
##        1       46       15       62 
## [1] ""
## [1] "cond"
## 
## Complete     None     Some    Total 
##        1       50       12       63 
## [1] ""
## [1] "confI"
## 
## Complete     None     Some    Total 
##        7       35       20       62 
## [1] ""
## [1] "cprob"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##            19             3             9             4             1 
##         Never         Total 
##            28            64 
## [1] ""
## [1] "cumu"
## 
## Complete     None     Some    Total 
##        3       44       16       63 
## [1] ""
## [1] "DF"
## 
## Complete     None     Some    Total 
##        3       33       27       63 
## [1] ""
## [1] "dummy"
## 
## Complete     None     Some    Total 
##        1       46       16       63 
## [1] ""
## [1] "excel"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##            20             1            19             2            12 
##         Never         Total 
##            10            64 
## [1] ""
## [1] "exp"
## 
## Complete     None     Some    Total 
##       20       12       30       62 
## [1] ""
## [1] "Ftest"
## 
##  None  Some Total 
##    60     3    63 
## [1] ""
## [1] "hist"
## 
##           1-3         10-12           4-6           6-9 AP\n   course 
##             3            12            15            20             3 
##       College         Never         Total 
##             1            10            64 
## [1] ""
## [1] "Hnull"
## 
## Complete     None     Some    Total 
##       22       20       20       62 
## [1] ""
## [1] "intrct"
## 
## Complete     None     Some    Total 
##        1       48       14       63 
## [1] ""
## [1] "LA"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##            14             5            14             1            11 
##         Never         Total 
##            19            64 
## [1] ""
## [1] "lincomb"
## 
## Complete     None     Some    Total 
##       11       35       17       63 
## [1] ""
## [1] "logist"
## 
##  None  Some Total 
##    50    13    63 
## [1] ""
## [1] "lreg"
## 
## Complete     None     Some    Total 
##       18       20       26       64 
## [1] ""
## [1] "LSQ"
## 
## Complete     None     Some    Total 
##        9       35       19       63 
## [1] ""
## [1] "main"
## 
## Complete     None     Some    Total 
##        1       48       13       62 
## [1] ""
## [1] "mmm"
## 
##     1-3   10-12     4-6     6-9 College   Never   Total 
##       7       1      33      14       1       2      58 
## [1] ""
## [1] "mTerm"
## 
## Complete     None     Some    Total 
##        1       50       12       63 
## [1] ""
## [1] "normal"
## 
##         10-12           6-9 AP\n   course       College         Never 
##            29            14             6             7             8 
##         Total 
##            64 
## [1] ""
## [1] "ortho"
## 
## Complete     None     Some    Total 
##        9       40       13       62 
## [1] ""
## [1] "out"
## 
## Complete     None     Some    Total 
##       45        5       14       64 
## [1] ""
## [1] "perc"
## 
## Complete     None     Some    Total 
##       39        1       24       64 
## [1] ""
## [1] "prob"
## 
##           1-3         10-12           4-6           6-9 AP\n   course 
##             5            26            13            17             1 
##       College         Never         Total 
##             1             1            64 
## [1] ""
## [1] "program"
## 
##           6-9 AP\n   course       College         Never         Total 
##             1             1            24            38            64 
## [1] ""
## [1] "pVal"
## 
## Complete     None     Some    Total 
##       12       26       26       64 
## [1] ""
## [1] "rand"
## 
## Complete     None     Some    Total 
##       15       24       24       63 
## [1] ""
## [1] "rank"
## 
##  None  Some Total 
##    57     6    63 
## [1] ""
## [1] "rcoef"
## 
## Complete     None     Some    Total 
##       11       17       36       64 
## [1] ""
## [1] "regres"
## 
##         10-12           6-9 AP\n   course       College         Never 
##            21             6             7            13            17 
##         Total 
##            64 
## [1] ""
## [1] "resid1"
## 
## Complete     None     Some    Total 
##        9       24       30       63 
## [1] ""
## [1] "sampD"
## 
## Complete     None     Some    Total 
##        5       25       33       63 
## [1] ""
## [1] "scatter"
## 
##         10-12           4-6           6-9 AP\n   course       College 
##            11            18            26             3             3 
##         Never         Total 
##             3            64 
## [1] ""
## [1] "sd"
## 
##    10-12      6-9  College Complete     None     Some    Total 
##        2        1        1       30        6       24       64 
## [1] ""
## [1] "sig"
## 
## Complete     None     Some    Total 
##        8       36       19       63 
## [1] ""
## [1] "skew"
## 
## Complete     None     Some    Total 
##        7       34       22       63 
## [1] ""
## [1] "sResid"
## 
## Complete     None     Some    Total 
##        6       39       18       63 
## [1] ""
## [1] "subsp"
## 
## Complete     None     Some    Total 
##        8       31       24       63 
## [1] ""
## [1] "t"
## 
##           1-3         10-12           6-9 AP\n   course       College 
##             1            11             1             9            13 
##         Never         Total 
##            29            64 
## [1] ""
## [1] "tTest"
## 
## Complete     None     Some    Total 
##        4       32       28       64 
## [1] ""
## [1] "TypeI"
## 
## Complete     None     Some    Total 
##        2       44       17       63 
## [1] ""