Sampling distribution
Represents a small portion of the population from which it is possible to infer the Population.
A sample of the population would be a portion of it, let´s say:
n = 10 (Where n is randomly chosen)
In this case:
Sample mean = estimate of the population mean.
(This mean may not match the population mean due to Sampling Error)
However if we get multiple samples of the same size those error will in fact bounce around to a point where eventually taking on the shape of a distribution.
Distribution of sample means that is a normal distribution.
It is called a Sampling distribution.
Sampling distribution: A distribution of mean values.
What is a Confidence Interval?
Basicamente o intervalo de confiança tem a ver com a amostra (sample) e a sua relação com a população (population) de onde a mostra foi retirada.
Sabemos que há uma coisa chamada de erro de amostragem, uma vez que a média de cada amostra que retiramos de uma população não será necessarimanete igual a média da população como um todo. Mas poderá ser próxima.
A ideia é definir uma faixa de valores baseados na nossa amostra que nos ajude a encontrar onde está a média da população.
Princípio Básico
Pela figura abaixo sabemos que:
Entendendo o Intervalo de Confiança de 95%
Vejamos um exemplo para entender o que é um Intervalo de Confiança.
VAmos usar Intervalos de Confinaça de 95%.
Digamos que temos uma população da qual retiramos 20 amostras, cada uma com um tamanho de 15 registros.
O gráfico abaixo mostar a distribuição dessas 20 amostras com um Intervalo de confiança de 95%.
A linha verde representa a real média dessa população que está próxima a 20.
Observe que apenas uma das 20 amostras ficou fora da média.
Observe que:
19/20 = 0.95 = 95%
Assim, podemos dizer que 95% das amostras da população são capazes de tocar na real média da população. Nâo é conincidência.
Entendendo o Intervalo de Confiança de 85%
Vejamos outro exemplo mas agora com um intervalo de confiança de 85%.
Observe que no gráfico abaixo 3 das 20 amostras (com 85% de intervalo de confiança) ficaram de fora da linha da média da população.
Novamente podemos calcular o intervalo de confiança.
17/20 = 0.85 = 85%
Assim, podemos dizer que 85% das amostras da população são capazes de tocar na real média da população. Novamente não é conincidência.
Como calcular o intervalo de confiança?
Dados:
sample.mean = x
standard.deviation = y (of the population)
sample.size = z
lowerSD = sample.mean- ((1.96 * standard.deviation) /sqrt(sample.size ))
upperSD = sample.mean+ ((1.96 * standard.deviation) /sqrt(sample.size ))
round(lowerSD,2)
round(upperSD ,2)
Resumindo
Portanto, a chave para entender intervalos de confiança está em saber que se nós retirarmos um número infinito de amostras de uma população, do mesmo tamanho, e calcularmos para cada uma dessas amostrar sua média e intervalos de confiança, esse intervalo de confiança dessas amostra, irá realmente capturar a real média da população.
Se usarmos um intervalo de confiança de 95% então 95% das amostras estarão dentro da média da população.
Se usarmos um intervalo de confiança de 85% então 85% das amostras estarão dentro da média da população.
Daqui vem a forma de explicar um intervalo de confiança.
“Quando usamos um intervalo de confiança de 95%, nós estamos 95% confiantes de que a média real da população está dentro da faixa de valores de nossa amostra. Estamos 95% confiantes pois sabemos que 5% da amostra irá na realidade errar essa média da população.”
No final, é uma questão de equilíbrio entre confiança e precisão.
Quanto mais confiante eu estiver (95%) menos precisa a minha faixa será, pois o boxplot será mais extenso (larger the spread around the sample mean).
Quais mais preciso eu estiver (85%) (smaller confidence interval) menos confiante eu fico já que agora só garanto 85% da média da amostra realmente compreende a real média da população.
Ou seja:
Confidence: The more confident I am the less precise my range is, larger the spread around my sample mean.
Precision: The more precise I am by smaller confidence interval less confident I am that my particular mean catchs the actual population mean
What is the Central Limit Theorem?
Regardless of the population shape (skewed), regardless of the sample shape, the sampling distribution (distribution of sample means) will approach normal. The greater the size of each sample, the more normal the sampling distribution will be.
So we can assume a distribution of means (sampling distribution) can utilize the normal distribution and it´s properties. It´ll have a stable simetrical shape, more instances in the middle, trailling off to either side equally.
A z-Distribution is also known as the Normal Distribution.
When examining the distribution of thousands of sample means of the same size from the same population, normality insues, regardless what the population shape or the sample looks like, and the spread of those mean values around their central core, the population mean, will get less and less as the size with in each of the samples gets larger, because the mean of larger samples, will more often fall closer to the actual population value.
When we take a sample from a population we know that this sample mean may not match the mean of the population due to sampling error.
But, we also know that, if we draw multiple samples each of the same size, from teh same population, those mean values that we get, will in fact bounce around, the true population.
Eventually, they will take the shape of a normal distribution of sample means.
Question we can answer with a z-test when we have a z-distribution: How likelly is it that this particular sample, with this particular mean value is like some proposed value.
A Histogram is a Sampling distribution.
The standard Deviatipon of the Sampling Distribution is also defined by the Central Theorem
And the Standard Deviatin of the Sampling Distribution is also known as the Standar Error. The larger the sample (n), smaller the Standard Error.
Therefore, it makes sense, a sample with a small amount of data within it, a low n, may have a hard time estimating the population mean, while a large sample, does a better job. In other words, the amount of error around the the central value of the sampling distribution will be less as reflected by a smaller standard error. That is the Central Limit Theorem.
For Example…
If the population sd = 30251
A small sample, n=5 in our case will have:
n = 10
standard error= 9578
Internal spread = 19156 = 2 * stadard error = 2 x 9578
So, 68% of the sample means of size 10 will have an internal spread of around 19156. That is the distance between one standard error below the mean and of Standard error above the mean.
For a sample n=30
n = 30
standard error= 5530 Internal spread = 11060 = 2 * stadard error = 2 x 5530
So, 68% of the sample means of size 30 will have an internal spread of around 19156. That is the distance between one standard error below the mean and of Standard error above the mean.
Definition
Standard error is the standard deviation of a sampling distribuiton.
The standard error of the mean, also called the standard deviation of the mean, is a method used to estimate the standard deviation of a sampling distribution.
To understand this, first we need to understand why a sampling distribution is required.
As an example, consider an experiment that measures the speed of sound in a material along the three directions (along x, y and z coordinates).
By taking the mean of these values, we can get the average speed of sound in this medium.
Mathematically, the standard error of the mean formula is given by:
Standard Error = Standard Deviation / sqrt(sample size)
It can be seen from the formula that the standard error of the mean decreases as N increases.
This is expected because if the mean at each step is calculated using a lot of data points, then a small deviation in one value will cause less effect on the final mean.
The standard error of the mean tells us how the mean varies with different experiments measuring the same quantity.
Thus if the effect of random changes are significant, then the standard error of the mean will be higher. If there is no change in the data points as experiments are repeated, then the standard error of mean is zero.
Step by Step Calculation of the Standard Error
Load a dataset
library(SDSFoundations)
survey = StudentSurvey
str(survey)
## 'data.frame': 379 obs. of 17 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 1 1 1 ...
## $ age : int 17 19 18 19 19 19 18 18 19 19 ...
## $ classification: Factor w/ 5 levels "Freshman","Junior",..: 1 5 1 5 5 5 1 1 5 5 ...
## $ name_letters : int 4 8 6 4 7 5 5 8 8 5 ...
## $ happy : int 80 76 50 75 89 90 57 60 75 100 ...
## $ concerts : int 2 15 3 0 1 1 0 0 1 4 ...
## $ hair_color : Factor w/ 4 levels "black","blond",..: 1 1 3 1 3 1 1 3 1 1 ...
## $ own_shoes : int 108 42 6 10 13 12 12 6 15 25 ...
## $ greek : Factor w/ 2 levels "no","yes": 2 1 1 2 1 2 1 1 1 1 ...
## $ live_campus : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 2 2 1 ...
## $ roomates : int 1 1 1 1 2 3 1 1 1 3 ...
## $ austin : num 10 10 10 8 9 10 7 8 8 8 ...
## $ birth_month : int 11 12 11 5 5 12 8 9 6 3 ...
## $ commute : Factor w/ 6 levels "bicycle","bus",..: 6 3 6 2 2 6 6 2 6 6 ...
## $ car : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 2 ...
## $ sport : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 2 ...
Calculate the population parameters
hist(survey$name_letters)
fivenum(survey$name_letters)
## [1] 2 5 6 7 10
mean(survey$name_letters)
## [1] 5.970976
sd(survey$name_letters)
## [1] 1.49486
Draw x samples of n=y and find the mean of each sample
number.of.samples = 1000
number.of.items.in.each.sample = 5
# Vector with 1000 places
# Loop to fill this each value of the vector with the mean of a different sample
xbar5 <-rep(NA, number.of.samples)
set.seed(221)
for (i in 1:1000)
{x <-sample(survey$name_letters, size =number.of.items.in.each.sample)
xbar5[i] <- mean(x)}
Graph the histogram of the sample means
hist(xbar5,xlim=c(2,10))
# Plot the mean of the real population
abline( v=mean(survey$name_letters),lwd=2,col='green')
# Plot the mean of the mean distribution
abline( v=mean(xbar5),lwd=2,col='red')
# Observe that they almosto overlap
Calculate the mean and sd of the sampling distribution
This value is the actual value of the mean and standard deviation of the distribution you created based on the population.
mean(xbar5)
## [1] 5.978
sd(xbar5)
## [1] 0.6646787
Calculate the Standard Error
See the formula to calculate the standard error above. Use it to calculate de Standard error
sd(survey$name_letters)/sqrt(5)
## [1] 0.6685215
Observe that the CALCULATED VIA FORMULA standard error is very close to the MEASURED standard deviation above (sd(xbar5))