Describing samples

The content of this part isn’t difficult, but it’s a necessary foundation for estimation and inference about data. For numerical data, we want to be able to describe the central tendency and dispersion of a sample. Categorical data are described by proportions of outcomes.

The arithmetic mean is the familiar “average” or “center of gravity” of a sample, or a typical value that we might expect from repeated sampling.

In R, the command to calculate the mean is simple: it’s

mean(x)

where the argument x refers to a bunch of numerical observations that you’ve defined as an object.

The standard deviation describes the spread of a distribution—how different, in general, are different observations from the mean? The SD is derived from another measure of dispersion called the variance, but we prefer the SD because (for reasons of arithmetic) it keeps the same units as our observations.

The SD from raw data is equally easy in R:

sd(x)

The median and interquartile range are other descriptors of numerical data. The median is a measure of central tendency, but it’s the “50% line” or middle value in your data when they are range from low to high. It shows you where half of your data are above and below the middle value. (When you have an even sample size, take the mean of the two middle numbers. The interquartile range is the “middle 50%”" of your data.

As you might predict, in R the median is simply

median(x)

If you need the 25%, 50% (median), and 75% values, use the function

quantile(x)

and R will report back all of them.

Boxplots visually demonstrate these values. Below, \(Y_i\) means “any given observation.” \(Y_1\) is 58 in., \(Y_2\) is 59 in., etc.

Last, the sample proportion shows the number of counts of a categorical outcome. In my fall 2016 biostats class there were 25 women out of 32, which is a sample proportion of 0.78. The proportion goes from 0 to 1.0, meaning “never” to “always.” Why don’t we use percents? I think that it was to make the math simpler when we apply proportions to other sample sizes.

You could just divide this, but if you were working with a huge table, the function

prop.table(x)

works well to report proportions. Let the software count for you.

But wait!

These approaches work like this when your are working from raw data. If your data are represented in a frequency distribution table, you need to account for each observation.

For example, if you had this frequency distribution of body heights:

Height	Freq.
61	1
62	4
63	5
65	8
66	4
68	2
70	1

The mean is not 65.0. That’s what you’d get if you added the height observations and divided by 7.

The problem is that you have 25 observations. Your raw data would be this: 61, 62, 62, 62, 62, 63, 63, 63, 63, 63, 65…skipping a few…70. The mean is 1614/25 = 64.56.

Don’t mess that one up. That’s rookie mistake.