Brian Holt
11/10/2020
What if we are doing a study on patience. And we find that out of everybody on the planet willing to wait in the Department of Motor Vehicles is surveyed. In other words, the y axis is the number of people who leave, while the x axis are the ‘bins’ or ‘containers’ of people leaving by a given period of time (10 min, 20 min, etc):
Notice the parts of the graph. The x-axis represents the number of minutes people are willing to wait while the y-axis is simply a count of the number of people.
I made the data up so that most people walk before an hour is up, which isn’t unreasonable.
And, since we can play ultimate deity at this point, I can show you the average and standard deviation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 10.00 10.01 12.00 27.00
## [1] 10.00802
## [1] 3.16079
## [1] 3.160806
| statistic | value |
|---|---|
| Mean | 10.008 |
| Standard Deviation | 3.161 |
Most of you are familiar with what an average is in this case it’s being referred to as they mean. It’s a measure of central tendency. There are more types of central tendencies but for now I think you can follow.
To interpret this average is basically saying that a person leaves the DMV on average of 10.008 minutes.
What you might not be familiar with is the other statistic called the standard deviation. That’s the 3.161
It’s basically a value that captures how much all of the data ‘revolve’, ‘hang-around’, ‘orbit’, or ‘vary’ around the average.
If the average is like a center of gravity, the data are like little satellites that orbit around that center of gravity.
But they won’t all line up exactly over that Center.
So the standard deviation is an effort to try to capture whether or not the orbiting data is really tight and close to the average or if it’s really spread out.
Small numbers mean that they’re very tight; big numbers mean they are very spread out.
Now, this data above is the full population of people having to go to the Department of Motor Vehicles. So technically we would not have this data; this is just theoretical.
In practice though, we may be able to sample small groups and to study them. Let’s say we do an experiment where we survey ten people.
Because we’re only sampling 10 people, it’s not impossible that the graph you see here has no data. On average I would expect at least some.
The problem with these small samples is that they vary a lot. Which means that our average is maybe off.
Let me do this 8 times and I will plant them all next to each other.
Basically, what you’re seeing is the same pattern where most of these small samples of 10 people are leaving within just a few minutes, if not immediately.
But here’s the big idea. If we can’t measure the whole population, how do we get any idea about that population through samples when are samples are so small?
In practice we might try to get larger samples. But there are some mathematical shortcuts that let us make some estimates. We won’t need to talk about that in this class.
It turns out that for certain populations, about 30 subjects are all that it is needed to get a reasonable estimate of the the population. Using the sample statistics, you can create what is called a confidence interval.
But as a mental exercise what we can do is play a different type deity and do what’s called a simulation. We get to magically sample from the population hundreds of thousands of times and from that we can get a pretty good estimate for this class.
To do this, I’m going to make a shift in the presentation. Instead of showing you histogram of single data points, I’m going to show you histograms averages.
In other words I’m going to take a sample, I’m going to calculate its average, and then that average is going to be a data point.
We will do this many times, and then plot that. Here is a list of doing that 20 times:
## [1] 8.4 10.8 8.8 9.3 10.6 10.9 10.2 11.1 8.8 9.4 7.6 9.6 9.5 9.3 9.5
## [16] 10.1 12.0 9.2 12.2 12.5
And if we were to take the average of this new data set of averages (yes, average of averages) we’d see the average is: 9.99
Recall that the average of the full population was: 10.00802
It kind of resembles the original distribution, but that is only because we have 20 samples.
If we do 1000 samples, we get something else
What you should see here is a little bit of a different type of graph, it’s actually much more normally distributed then the actual population data.
The reason has to do with the fact that we’re sampling and taking the average putting the average into this last rap. You should also notice that the standard deviation is going to be smaller then the standard deviation of the full population.
| Statistic | value |
|---|---|
| mean(pop) | 10.008 |
| mean(samples) | 10.0061 |
| standard dev (pop) | 3.161 |
| standard dev (samples) | 1.023 |
There is actually a formula that will calculate the classic Normal bell curve distribution looking graph. But what you might not realize is that you can kind of visualize it by reducing the size of the bins on the x-axis
The reason this matters is that it shows that you can actually judge the probability of Something Happening by measuring the area under a particular part of the Curve. Everything underneath the curve is equal to 100%. any area underneath that curve can be taken out of 100% and then may be interpreted as the probability of occurring.
What might be lost in all of this is that a lot of what statistics in psychology is about attempts to make inferences about populations that we can’t measure, but we can take samples.
I hope that you take away from this quick video that we can used small’ish samples to make inferences about large populations