Introduction to Statistics

Basic Concept

All statistical models are of the form \[ outcome_{i} = (Model)+error_{i} \]
Mean is the simplest model (or a default model) if there is nothing
A deviation is the difference beween the mean and the actual data point
Deviation is also called Error
In a more general sense, error is the difference between the model value and the actual data point

The sum of squares, variance, and standard deviation represent the same thing:
- The 'fit' of the mean to the data
- The variability in the data
- How well the mean represents the observed data
- Error

A statistic for which the frequency of particular values is known.
Observed values can be used to test hypotheses. \[ test~statistic=\frac {\text {variance explained by the model}}{\text{variance NOT explained by the model}} \] \[ test~statistic=\frac {\text {effect}}{\text{error}} \]

Standard error is the standard deviation of a test statistic
Suppose we are trying to make an inference about the average weight of people in a city
Suppose we take a random sample of 10 people and compute the mean
If we repeat this process 500 times (say), then we can create a histogram of the 500 sample means
The grand mean of all 500 sample means will approach the actual population mean (not observed)
The standard deviation of the 500 sample means is called the Standard Error of the Mean (SME) denoted by \( \sigma_{\bar{X}} \)

In practice, we are not going to draw 500 different samples (with each of size 10)
When the sample size is bigger than 30, CLT says the distribution of sample means approximates Normal distribution
So just one sample of size 10 is drawn
SME gives the indication about the “unreliability” or spread of the grand mean from the actual population mean
A small SE indicates that most sample means are similar to the population mean and so our sample is likely to be representative of the population

\( SME=\sigma_{\bar{X}} = \sigma_{population}/\sqrt{N} \) where N is the sample size
However, \( \sigma_{population} \) is not available
But it can be approximated with standard deviation of the samples (10 numbers in our example) when the sample size is big (>30)
\( SME=\sigma_{\bar{X}} = s/\sqrt{N} \)

When the sample size is small (<30), the test statistic follows t-distribution
For larger samples, it approaches normal distribution
t-distribution incorporates more uncertainty (and hence wider Confidence Interval) than Normal distribution

sampleSize=21
data1=c(18,16,18,24,23,22,22,23,26,29,32,34,34,36,36,43,42,49,46,46,57)
var=var(data1); var

[1] 134.2619

s=sqrt(var); s

[1] 11.58714

mean=mean(data1); mean

[1] 32.19048

SSE=s/sqrt(sampleSize); SSE

[1] 2.528522

#95% CI using t-distribution because of small sample size
lower.boundary=mean-2.09*SSE
upper.boundary=mean+2.09*SSE
lower.boundary;upper.boundary

[1] 26.90586

[1] 37.47509