Introduction

In this notebook we’ll look at the population of all Massey students from last year and see how various statistics computed on a random sample change from sample to sample. The idea is that normally we only have one sample, a subset of the population. Our goal is to say something about the population though, such as to estimate a population proportion or population mean. By knowing how these quantities vary from sample to sample, we can get an idea of how to work back from a single sample to the population.

Our population consists of the following variables on the population of students at Massey university.

Variable Description
GPA Grade point average (between 0 and 9 inclusive)
Age Age (years)
Sex Sex (female/male)

We start by loading some needed packages/functions and then reading the population data from the web and looking at the first few rows:

library(ggplot2)
source("http://www.massey.ac.nz/~jcmarsha/227212/sampling.R")
massey = read.csv("http://www.massey.ac.nz/~jcmarsha/227212/data/massey.csv")
head(massey)

Distribution of a sample proportion

We start by assessing what the true proportion of females is in the population. In most situations we won’t know this - instead we’ll be trying to estimate it from a sample. We’re kind of working backwards today so we can get a feel for how what we see in a sample relates to the truth in the population.

ggplot(massey) + geom_bar(aes(x=Sex))

proportion(massey$Sex, "female")
## [1] 0.5287654

Next, let’s take a sample and work out the proportion of females there to see how it compares to the population. We’ll start with a sample of size 20. Try running this code block a few times - you’ll get different samples (so different plots and proportions each time).

samp = take_samples(massey, 20)
ggplot(samp) + geom_bar(aes(x=Sex))

proportion(samp$Sex, "female")
## [1] 0.7

Let’s automate the process of repetition. We’ll take 20 samples and see what we get for the sample proportions across those 20 samples. You should notice tht the 20 samples differ a little bit. You should notice that the proportion of females in each sample are on average around about the population proportion.

samps = take_samples(massey, 20, 20)
ggplot(samps) + geom_bar(aes(x=Sex)) + facet_wrap(~Sample)

props = sample_summary(samps, Sex, proportion, "female")
summary(props)
##      Sample        Size      proportion    
##  1      : 1   Min.   :20   Min.   :0.3000  
##  2      : 1   1st Qu.:20   1st Qu.:0.4500  
##  3      : 1   Median :20   Median :0.5000  
##  4      : 1   Mean   :20   Mean   :0.4775  
##  5      : 1   3rd Qu.:20   3rd Qu.:0.5125  
##  6      : 1   Max.   :20   Max.   :0.7000  
##  (Other):14

Going a step further, let’s take 1000 samples to get a better feel for the distribution that we get in the sample proportions (our ‘estimates’) and how this compares to the population proportion (the ‘truth’).

samp20 = take_samples(massey, 20, 1000)
props20 = sample_summary(samp20, Sex, proportion, "female")
summary(props20)
##      Sample         Size      proportion    
##  1      :  1   Min.   :20   Min.   :0.2000  
##  2      :  1   1st Qu.:20   1st Qu.:0.4500  
##  3      :  1   Median :20   Median :0.5500  
##  4      :  1   Mean   :20   Mean   :0.5276  
##  5      :  1   3rd Qu.:20   3rd Qu.:0.6000  
##  6      :  1   Max.   :20   Max.   :0.9000  
##  (Other):994
ggplot(props20) + geom_histogram(aes(x=proportion), binwidth=0.05)

What do you notice about the distribution of the proportion of females you get across the 1000 samples? the majority of the samples were 50/50 male female which will be the most accurate representation

Finally, we’ll repeat the process with some larger sample sizes, to help figure out how sample size influences things.

samp20 = take_samples(massey, 20, 1000)
samp80 = take_samples(massey, 80, 1000)
samp320 = take_samples(massey, 320, 1000)
props20 = sample_summary(samp20, Sex, proportion, "female")
props80 = sample_summary(samp80, Sex, proportion, "female")
props320 = sample_summary(samp320, Sex, proportion, "female")
props_all = rbind(props20, props80, props320)
ggplot(props_all) + geom_histogram(aes(x=proportion), binwidth=0.05) + facet_wrap(~Size)

What do you notice about the distribution of the sample proportion? How does that distribution change when the sample size increases? As sample size increases distribution decreases

Distribution of a sample mean

We start by looking at the distribution of ages in the population, and compute the mean age in the population.

mean(massey$Age)
## [1] 28.46204
ggplot(massey) + geom_histogram(aes(x=Age), bins=30)

We next take 20 samples each of size 20, and see how they are distributed. We also compute their sample means and assess how the sample means vary.

samps = take_samples(massey, 20, 20)
ggplot(samps) + geom_histogram(aes(x=Age), bins=8) + facet_wrap(~Sample)

means = sample_summary(samps, Age, mean)
summary(means)
##      Sample        Size         mean      
##  1      : 1   Min.   :20   Min.   :23.81  
##  2      : 1   1st Qu.:20   1st Qu.:27.28  
##  3      : 1   Median :20   Median :27.85  
##  4      : 1   Mean   :20   Mean   :28.17  
##  5      : 1   3rd Qu.:20   3rd Qu.:29.03  
##  6      : 1   Max.   :20   Max.   :35.48  
##  (Other):14

You should notice that the mean age in each sample is on average around about the population age.

Lastly, we take 1000 samples each of size 20, 80 and 320. We then compute the sample means and assess how the sample means vary for each of these sample sizes.

samp20 = take_samples(massey, 20, 1000)
samp80 = take_samples(massey, 80, 1000)
samp320 = take_samples(massey, 320, 1000)
means20 = sample_summary(samp20, Age, mean)
means80 = sample_summary(samp80, Age, mean)
means320 = sample_summary(samp320, Age, mean)
means_all = rbind(means20, means80, means320)
ggplot(means_all) + geom_histogram(aes(x=mean), bins=20) + facet_wrap(~Size)

What do you notice about the distribution of the sample mean? How does that distribution change when the sample size increases? Aproximately an average age of 28 and is not representative of the total sample population.

What is your conclusion from all this? What can you say about how you’d expect sample means, or sample proportions to be distributed when repeatedly sampling from a population? How might this help you if you have just one sample mean or proportion, say of size 80? Could it be used to give a measure of precision maybe? Discuss this with those around you, and write some notes below.

The bigger your sample size the more representative it is of your distribution. As long as outlyers are removed. the larger the sample size the smaller margin of error