Introduction to Statistics

Topics

Basic Concept

Basic Concept

  • All statistical models are of the form \[ outcome_{i} = (Model)+error_{i} \]
  • Mean is the simplest model (or a default model) if there is nothing
  • A deviation is the difference beween the mean and the actual data point
  • Deviation is also called Error
  • In a more general sense, error is the difference between the model value and the actual data point

Sum of Squared Errors

  • Sum of squared errors (SSE) assesses the magnitude of total error
  • Its value is dependent on the number of points
  • It increases with the increase in number of points
  • Mean Square Error (MSE) takes the average of SSE
  • MSE= SSE/N-1 (in case of sample data)
  • MSE is the same as variance

Important Points

  • The sum of squares, variance, and standard deviation represent the same thing:
    • The 'fit' of the mean to the data
    • The variability in the data
    • How well the mean represents the observed data
    • Error

Test Statistics

  • A statistic for which the frequency of particular values is known.
  • Observed values can be used to test hypotheses. \[ test~statistic=\frac {\text {variance explained by the model}}{\text{variance NOT explained by the model}} \] \[ test~statistic=\frac {\text {effect}}{\text{error}} \]

Standard Error (SE)

  • Standard error is the standard deviation of a test statistic
  • Suppose we are trying to make an inference about the average weight of people in a city
  • Suppose we take a random sample of 10 people and compute the mean
  • If we repeat this process 500 times (say), then we can create a histogram of the 500 sample means
  • The grand mean of all 500 sample means will approach the actual population mean (not observed)
  • The standard deviation of the 500 sample means is called the Standard Error of the Mean (SME) denoted by \( \sigma_{\bar{X}} \)

Standard Error (SE)

  • In practice, we are not going to draw 500 different samples (with each of size 10)
  • When the sample size is bigger than 30, CLT says the distribution of sample means approximates Normal distribution
  • So just one sample of size 10 is drawn
  • SME gives the indication about the “unreliability” or spread of the grand mean from the actual population mean
  • A small SE indicates that most sample means are similar to the population mean and so our sample is likely to be representative of the population

Standard Error (SE)

  • \( SME=\sigma_{\bar{X}} = \sigma_{population}/\sqrt{N} \) where N is the sample size
  • However, \( \sigma_{population} \) is not available
  • But it can be approximated with standard deviation of the samples (10 numbers in our example) when the sample size is big (>30)
  • \( SME=\sigma_{\bar{X}} = s/\sqrt{N} \)

Small sample size (t-distribution)

  • When the sample size is small (<30), the test statistic follows t-distribution
  • For larger samples, it approaches normal distribution
  • t-distribution incorporates more uncertainty (and hence wider Confidence Interval) than Normal distribution

Numerical Example

sampleSize=21
data1=c(18,16,18,24,23,22,22,23,26,29,32,34,34,36,36,43,42,49,46,46,57)
var=var(data1); var
[1] 134.2619
s=sqrt(var); s
[1] 11.58714
mean=mean(data1); mean
[1] 32.19048

Numerical Example

SSE=s/sqrt(sampleSize); SSE
[1] 2.528522
#95% CI using t-distribution because of small sample size
lower.boundary=mean-2.09*SSE
upper.boundary=mean+2.09*SSE
lower.boundary;upper.boundary
[1] 26.90586
[1] 37.47509