Sampling

20180210102800

Sampling distribution

Represents a small portion of the population from which it is possible to infer the Population.

A sample of the population would be a portion of it, let´s say:

n = 10 (Where n is randomly chosen)

In this case:

Sample mean = estimate of the population mean.
(This mean may not match the population mean due to Sampling Error)

However if we get multiple samples of the same size those error will in fact bounce around to a point where eventually taking on the shape of a distribution.

Distribution of sample means that is a normal distribution.

It is called a Sampling distribution.

Sampling distribution: A distribution of mean values.

20180204093400

What is a Confidence Interval?

Basicamente o intervalo de confiança tem a ver com a amostra (sample) e a sua relação com a população (population) de onde a mostra foi retirada.

Sabemos que há uma coisa chamada de erro de amostragem, uma vez que a média de cada amostra que retiramos de uma população não será necessarimanete igual a média da população como um todo. Mas poderá ser próxima.

A ideia é definir uma faixa de valores baseados na nossa amostra que nos ajude a encontrar onde está a média da população.

20180204093401

Princípio Básico

Pela figura abaixo sabemos que:

  • Para qualquer população (POPULATION)
  • Podemos retirar amostras diversas de diversos tamanhos (S1, S2, S3)
  • E cada uma dela terá erros de amostragem (sampling error)
  • E sabemos também que as médias dessas amostras não serão iguais (X1,X2,X3).
  • Sabemos também que os desvio padrão (standard devitations) também serão diferentes (Sd1, sd2, sd3).

20180204093402

Entendendo o Intervalo de Confiança de 95%

Vejamos um exemplo para entender o que é um Intervalo de Confiança.

VAmos usar Intervalos de Confinaça de 95%.

Digamos que temos uma população da qual retiramos 20 amostras, cada uma com um tamanho de 15 registros.

O gráfico abaixo mostar a distribuição dessas 20 amostras com um Intervalo de confiança de 95%.

A linha verde representa a real média dessa população que está próxima a 20.

Observe que apenas uma das 20 amostras ficou fora da média.

Observe que:

19/20 = 0.95 = 95%

Assim, podemos dizer que 95% das amostras da população são capazes de tocar na real média da população. Nâo é conincidência.

20180204093403

Entendendo o Intervalo de Confiança de 85%

Vejamos outro exemplo mas agora com um intervalo de confiança de 85%.

Observe que no gráfico abaixo 3 das 20 amostras (com 85% de intervalo de confiança) ficaram de fora da linha da média da população.

Novamente podemos calcular o intervalo de confiança.

17/20 = 0.85 = 85%

Assim, podemos dizer que 85% das amostras da população são capazes de tocar na real média da população. Novamente não é conincidência.

20180210103200

Como calcular o intervalo de confiança?

Dados:
sample.mean = x
standard.deviation = y (of the population)
sample.size = z

lowerSD = sample.mean- ((1.96 * standard.deviation) /sqrt(sample.size ))
upperSD = sample.mean+ ((1.96 * standard.deviation) /sqrt(sample.size ))

round(lowerSD,2)
round(upperSD ,2)

20180204093404

Resumindo

Portanto, a chave para entender intervalos de confiança está em saber que se nós retirarmos um número infinito de amostras de uma população, do mesmo tamanho, e calcularmos para cada uma dessas amostrar sua média e intervalos de confiança, esse intervalo de confiança dessas amostra, irá realmente capturar a real média da população.

Se usarmos um intervalo de confiança de 95% então 95% das amostras estarão dentro da média da população.

Se usarmos um intervalo de confiança de 85% então 85% das amostras estarão dentro da média da população.

Daqui vem a forma de explicar um intervalo de confiança.

“Quando usamos um intervalo de confiança de 95%, nós estamos 95% confiantes de que a média real da população está dentro da faixa de valores de nossa amostra. Estamos 95% confiantes pois sabemos que 5% da amostra irá na realidade errar essa média da população.”

No final, é uma questão de equilíbrio entre confiança e precisão.

Quanto mais confiante eu estiver (95%) menos precisa a minha faixa será, pois o boxplot será mais extenso (larger the spread around the sample mean).

Quais mais preciso eu estiver (85%) (smaller confidence interval) menos confiante eu fico já que agora só garanto 85% da média da amostra realmente compreende a real média da população.

Ou seja:

Confidence: The more confident I am the less precise my range is, larger the spread around my sample mean.

Precision: The more precise I am by smaller confidence interval less confident I am that my particular mean catchs the actual population mean

20180205102000

What is the Central Limit Theorem?

Regardless of the population shape (skewed), regardless of the sample shape, the sampling distribution (distribution of sample means) will approach normal. The greater the size of each sample, the more normal the sampling distribution will be.

So we can assume a distribution of means (sampling distribution) can utilize the normal distribution and it´s properties. It´ll have a stable simetrical shape, more instances in the middle, trailling off to either side equally.

A z-Distribution is also known as the Normal Distribution.

When examining the distribution of thousands of sample means of the same size from the same population, normality insues, regardless what the population shape or the sample looks like, and the spread of those mean values around their central core, the population mean, will get less and less as the size with in each of the samples gets larger, because the mean of larger samples, will more often fall closer to the actual population value.

When we take a sample from a population we know that this sample mean may not match the mean of the population due to sampling error.

But, we also know that, if we draw multiple samples each of the same size, from teh same population, those mean values that we get, will in fact bounce around, the true population.

Eventually, they will take the shape of a normal distribution of sample means.

Question we can answer with a z-test when we have a z-distribution: How likelly is it that this particular sample, with this particular mean value is like some proposed value.

A Histogram is a Sampling distribution.

The standard Deviatipon of the Sampling Distribution is also defined by the Central Theorem

And the Standard Deviatin of the Sampling Distribution is also known as the Standar Error. The larger the sample (n), smaller the Standard Error.

Therefore, it makes sense, a sample with a small amount of data within it, a low n, may have a hard time estimating the population mean, while a large sample, does a better job. In other words, the amount of error around the the central value of the sampling distribution will be less as reflected by a smaller standard error. That is the Central Limit Theorem.

For Example…

If the population sd = 30251

A small sample, n=5 in our case will have:

n = 10
standard error= 9578
Internal spread = 19156 = 2 * stadard error = 2 x 9578

So, 68% of the sample means of size 10 will have an internal spread of around 19156. That is the distance between one standard error below the mean and of Standard error above the mean.

For a sample n=30

n = 30
standard error= 5530 Internal spread = 11060 = 2 * stadard error = 2 x 5530

So, 68% of the sample means of size 30 will have an internal spread of around 19156. That is the distance between one standard error below the mean and of Standard error above the mean.

20180210100801

Definition

Standard error is the standard deviation of a sampling distribuiton.

The standard error of the mean, also called the standard deviation of the mean, is a method used to estimate the standard deviation of a sampling distribution.

To understand this, first we need to understand why a sampling distribution is required.

As an example, consider an experiment that measures the speed of sound in a material along the three directions (along x, y and z coordinates).

By taking the mean of these values, we can get the average speed of sound in this medium.

Mathematically, the standard error of the mean formula is given by:

  Standard Error = Standard Deviation / sqrt(sample size)

It can be seen from the formula that the standard error of the mean decreases as N increases.

This is expected because if the mean at each step is calculated using a lot of data points, then a small deviation in one value will cause less effect on the final mean.

The standard error of the mean tells us how the mean varies with different experiments measuring the same quantity.

Thus if the effect of random changes are significant, then the standard error of the mean will be higher. If there is no change in the data points as experiments are repeated, then the standard error of mean is zero.

20180210101401

Load a dataset

    library(SDSFoundations)
    survey = StudentSurvey
    str(survey)
## 'data.frame':    379 obs. of  17 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender        : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ age           : int  17 19 18 19 19 19 18 18 19 19 ...
##  $ classification: Factor w/ 5 levels "Freshman","Junior",..: 1 5 1 5 5 5 1 1 5 5 ...
##  $ name_letters  : int  4 8 6 4 7 5 5 8 8 5 ...
##  $ happy         : int  80 76 50 75 89 90 57 60 75 100 ...
##  $ concerts      : int  2 15 3 0 1 1 0 0 1 4 ...
##  $ hair_color    : Factor w/ 4 levels "black","blond",..: 1 1 3 1 3 1 1 3 1 1 ...
##  $ own_shoes     : int  108 42 6 10 13 12 12 6 15 25 ...
##  $ greek         : Factor w/ 2 levels "no","yes": 2 1 1 2 1 2 1 1 1 1 ...
##  $ live_campus   : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 2 2 1 ...
##  $ roomates      : int  1 1 1 1 2 3 1 1 1 3 ...
##  $ austin        : num  10 10 10 8 9 10 7 8 8 8 ...
##  $ birth_month   : int  11 12 11 5 5 12 8 9 6 3 ...
##  $ commute       : Factor w/ 6 levels "bicycle","bus",..: 6 3 6 2 2 6 6 2 6 6 ...
##  $ car           : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 2 ...
##  $ sport         : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 2 ...

20180210101402

Calculate the population parameters

    hist(survey$name_letters)

    fivenum(survey$name_letters)
## [1]  2  5  6  7 10
    mean(survey$name_letters)
## [1] 5.970976
    sd(survey$name_letters)
## [1] 1.49486

20180210101403

Draw x samples of n=y and find the mean of each sample

    number.of.samples = 1000
    number.of.items.in.each.sample = 5
    # Vector with 1000 places
    
    # Loop to fill this each value of the vector with the mean of a different sample
    xbar5 <-rep(NA, number.of.samples)
    set.seed(221)
    for (i in 1:1000)
    {x <-sample(survey$name_letters, size =number.of.items.in.each.sample)
    xbar5[i] <-  mean(x)}

20180210101404

Graph the histogram of the sample means

    hist(xbar5,xlim=c(2,10))
    
  # Plot the mean of the real population
    
    abline( v=mean(survey$name_letters),lwd=2,col='green')
    
    # Plot the mean of the mean distribution
    
    abline( v=mean(xbar5),lwd=2,col='red')

    # Observe that they almosto overlap    

20180210101405

Calculate the mean and sd of the sampling distribution

This value is the actual value of the mean and standard deviation of the distribution you created based on the population.

    mean(xbar5)
## [1] 5.978
    sd(xbar5)
## [1] 0.6646787

20180210101406

Calculate the Standard Error

See the formula to calculate the standard error above. Use it to calculate de Standard error

  sd(survey$name_letters)/sqrt(5)    
## [1] 0.6685215

Observe that the CALCULATED VIA FORMULA standard error is very close to the MEASURED standard deviation above (sd(xbar5))