Two main topics for today's class.
1. Review of organization of data
2. Simple descriptive statistics
Before class, pull out a few spreadsheets for the elevator data on Moodle
Show some that are very disorganized:
Examples:
summary( lm( time ~ action, data=elevator))And show one or two that are well organized. Try to do a quick statistical analysis of it to show where we'll be heading.
Hand this PDF handout out to the students or direct them to the link on the syllabus.
Statistics is the explanation of variation in the context of what remains unexplained. Our purpose today, with these simple descriptive statistics, is to be able to describe variation in meaningful ways in a single, quantitative variable. Soon, we'll move on to relating two or more variables, but for today it's just a single variable.
Start with the idea that it's really the distribution of values that we're interested in.
Explain the density plot by reference to the dotplot of points at the bottom of the figure. The height of the graph shows how dense the points are.
More technically, imagine a graph — the cumulative distribution function — showing the fraction of cases that fall below a given value versus that given value. This will be a upward stepping graph, like this:
plot(ecdf(CPS85$wage)) # not a command the students need to know
The density plot is the derivative of this graph. Note that if you integrate the derivative from \( -\infty \) to \( \infty \), you'll get 1, the total of increase in the original function. In other words, the area under the entire density curve is 1. That's what determines the units on the vertical axis of the density curve.
When to use each:
bwplot(wage ~ sector, data = CPS85)
The 1.5 IQR rule of thumb. Demo this by showing a box-and-whisker plot with some outliers and showing how the whiskers extend to 1.5 IQR from the first and third quartiles.
Take everyone out into the hall and line them up from shortest to tallest. Assign each person a rank, which is just their order in the line. When there is a tie, the order of the people involved in the tie is arbitrary, so average the naive ranks that would be assigned to the people involved in the tie.
Point out the min, max, median, first and third quantile.
Sometimes it's nice to be able to summarize a distribution with just a small set of numbers. Some possibilities:
We'll be making extensive use of the mean and standard deviation in this course. The reason to prefer these won't become apparent until a few weeks into the semester.
“Standard Deviation” -> “Typical Spread” in a more modern terminology.
In French, it's literally “typical spread”: ėcart type
Give the formulas for mean and standard deviation
\[ m = \frac{1}{n}\sum_{k=1}^{n} x_k \]
\[ v = \frac{1}{n-1} \sum_{k=1}^{n} (x_k - m)^2 \]
\[ s = \sqrt{v} \]
Eyeballing: Standard deviation on a bell-shaped distribution: more or less the half-width at half-height.
Units of m and s.
Examples of estimation of s.