samples and populations

Brian Holt

11/10/2020

Histogram example

What if we are doing a study on patience. And we find that out of everybody on the planet willing to wait in the Department of Motor Vehicles is surveyed. In other words, the y axis is the number of people who leave, while the x axis are the ‘bins’ or ‘containers’ of people leaving by a given period of time (10 min, 20 min, etc):

Notice the parts of the graph. The x-axis represents the number of minutes people are willing to wait while the y-axis is simply a count of the number of people.

The data is fake

I made the data up so that most people walk before an hour is up, which isn’t unreasonable.

And, since we can play ultimate deity at this point, I can show you the average and standard deviation:

summary(pop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   10.00   10.01   12.00   27.00
mean(pop)
## [1] 10.00802
population_sd(pop)
## [1] 3.16079
sd(pop)
## [1] 3.160806

basic descriptive stats

statistic value
Mean 10.008
Standard Deviation 3.161

Most of you are familiar with what an average is in this case it’s being referred to as they mean. It’s a measure of central tendency. There are more types of central tendencies but for now I think you can follow.

To interpret this average is basically saying that a person leaves the DMV on average of 10.008 minutes.

Standard deviation

What you might not be familiar with is the other statistic called the standard deviation. That’s the 3.161

It’s basically a value that captures how much all of the data ‘revolve’, ‘hang-around’, ‘orbit’, or ‘vary’ around the average.

If the average is like a center of gravity, the data are like little satellites that orbit around that center of gravity.

center of gravity

But they won’t all line up exactly over that Center.

So the standard deviation is an effort to try to capture whether or not the orbiting data is really tight and close to the average or if it’s really spread out.

Small numbers mean that they’re very tight; big numbers mean they are very spread out.

Drawing a sample

Now, this data above is the full population of people having to go to the Department of Motor Vehicles. So technically we would not have this data; this is just theoretical.

In practice though, we may be able to sample small groups and to study them. Let’s say we do an experiment where we survey ten people.

sample 1 histogram

Because we’re only sampling 10 people, it’s not impossible that the graph you see here has no data. On average I would expect at least some.

Another sample

The problem with these small samples is that they vary a lot. Which means that our average is maybe off.

Let me do this 8 times and I will plant them all next to each other.

8 samples

Interpret several graphs

Basically, what you’re seeing is the same pattern where most of these small samples of 10 people are leaving within just a few minutes, if not immediately.

But here’s the big idea. If we can’t measure the whole population, how do we get any idea about that population through samples when are samples are so small?

inferences to large populations

In practice we might try to get larger samples. But there are some mathematical shortcuts that let us make some estimates. We won’t need to talk about that in this class.

It turns out that for certain populations, about 30 subjects are all that it is needed to get a reasonable estimate of the the population. Using the sample statistics, you can create what is called a confidence interval.

But as a mental exercise what we can do is play a different type deity and do what’s called a simulation. We get to magically sample from the population hundreds of thousands of times and from that we can get a pretty good estimate for this class.

Repetitive samples

To do this, I’m going to make a shift in the presentation. Instead of showing you histogram of single data points, I’m going to show you histograms averages.

In other words I’m going to take a sample, I’m going to calculate its average, and then that average is going to be a data point.

We will do this many times, and then plot that. Here is a list of doing that 20 times:

##  [1]  8.4 10.8  8.8  9.3 10.6 10.9 10.2 11.1  8.8  9.4  7.6  9.6  9.5  9.3  9.5
## [16] 10.1 12.0  9.2 12.2 12.5

And if we were to take the average of this new data set of averages (yes, average of averages) we’d see the average is: 9.99

Recall that the average of the full population was: 10.00802

Histogram of the simulation

It kind of resembles the original distribution, but that is only because we have 20 samples.

If we do 1000 samples, we get something else

probability distribution

What you should see here is a little bit of a different type of graph, it’s actually much more normally distributed then the actual population data.

Why ‘more’ normal

The reason has to do with the fact that we’re sampling and taking the average putting the average into this last rap. You should also notice that the standard deviation is going to be smaller then the standard deviation of the full population.

Statistic value
mean(pop) 10.008
mean(samples) 10.0061
standard dev (pop) 3.161
standard dev (samples) 1.023

What is normal?

There is actually a formula that will calculate the classic Normal bell curve distribution looking graph. But what you might not realize is that you can kind of visualize it by reducing the size of the bins on the x-axis

So?

The reason this matters is that it shows that you can actually judge the probability of Something Happening by measuring the area under a particular part of the Curve. Everything underneath the curve is equal to 100%. any area underneath that curve can be taken out of 100% and then may be interpreted as the probability of occurring.

What might be lost in all of this is that a lot of what statistics in psychology is about attempts to make inferences about populations that we can’t measure, but we can take samples.

I hope that you take away from this quick video that we can used small’ish samples to make inferences about large populations