Standard Error:

In statistics, the standard error (SE) is a measure of how much the sample mean is expected to vary from the true population mean. It is a standard deviation of the sampling distribution of the sample mean. The standard error quantifies the precision of your sample mean estimate.

The formula for standard error when estimating the population mean (\(\mu\)) is:

\(SE=\dfrac{s}{\sqrt n}\)

Where:

  • s is the sample standard deviation,

  • n is the sample size.

A larger sample size (n) or a smaller standard deviation (s) results in a smaller standard error, indicating a more precise estimate of the population mean.

Sampling Distribution

A sampling distribution is a theoretical distribution that describes the likelihood of different possible values of a statistic (such as the mean, variance, standard deviation, etc.) based on samples of a particular size drawn from a population. In simpler terms, it provides information about how the values of a statistic might vary if we were to take many samples from the same population.

Key points about sampling distributions:

  1. Central Limit Theorem (CLT): The sampling distribution is often discussed in the context of the Central Limit Theorem. According to the CLT, as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the shape of the population distribution. This is a fundamental concept in statistics and is crucial for making inferences about population parameters.

  2. Parameters vs. Statistics: A parameter is a numerical summary of a population, such as the population mean or standard deviation. A statistic is a numerical summary of a sample, such as the sample mean or standard deviation. The sampling distribution provides information about the distribution of statistics.

  3. Standard Error: The standard error is a measure of the variability of a statistic in the sampling distribution. It is a crucial concept when making inferences about population parameters based on sample statistics. The standard error is often used in the calculation of confidence intervals and hypothesis tests.

  4. Role in Inference: The sampling distribution is fundamental to statistical inference. When we conduct hypothesis tests or construct confidence intervals, we often rely on information about the sampling distribution to make conclusions about population parameters.

In summary, a sampling distribution helps us understand the behavior of sample statistics and provides a foundation for making statistical inferences about population parameters based on sample data.

Q1 Tire Store Example: Hypothesis Testing

Suppose, for example, that the mean expenditure per customer at a tire store is $85.00, with a standard deviation of $9.00. If a random sample of 40 customers is taken, what is the probability that the sample average expenditure per customer for this sample will be $87.00 or more?

Since the sample size is greater than 30, the central limit theorem can be applied (assuming certain regularity conditions), and the sample means are normally distributed.

Hypotheses:

\(H_0\):The average expenditure per customer is $85 or less.

\(H_1\)​:The average expenditure per customer is more than $85.

Standard Error Calculation:

The formula for standard error when estimating the population mean ($\mu$) is:

\(SE=​\dfrac{s}{\sqrt n}\)​

Where:

  • s is the sample standard deviation,

  • n is the sample size.

Probability Calculation:

\(P(\bar X \geq 87.00)\)

This code below calculates the probability that the sample average expenditure per customer will be $87.00 or more, considering the central limit theorem and the standard error.

    pnorm(q = 87,mean = 85,sd = 9/sqrt(40), lower.tail=FALSE)
## [1] 0.07994275
1 - pnorm(q = 87,mean = 85,sd = 9/sqrt(40), lower.tail=TRUE)
## [1] 0.07994275

We can plot this as well -

# Install and load ggplot2 if not already installed
# install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
# Given values
population_mean <- 85.00
sample_sd <- 9.00
sample_size <- 40
sample_mean <- 87.00

# Calculate standard error
se <- sample_sd / sqrt(sample_size)

# Calculate Z-score
z_score <- (sample_mean - population_mean) / se
z_score
## [1] 1.405457
# Calculate the probability using Z-score
probability <- pnorm(q = sample_mean, mean = population_mean, sd = se, lower.tail = FALSE)
probability
## [1] 0.07994275
# Create a data frame for visualization
df <- data.frame(x = seq(80, 90, length.out = 1000))

# Plot the normal distribution curve
ggplot(data = df, 
       mapping = aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = population_mean, sd = se), size = 1, color = "black") +
  
  # Shade the area to the right of $87.00
  geom_ribbon(data = subset(df, x >= 87), aes(ymax = dnorm(x, mean = population_mean, sd = se), ymin = 0), fill = "skyblue", alpha = 0.5) +
  
  # Add vertical line at $87.00
  geom_vline(xintercept = 87, linetype = "dashed", color = "red", size = 1) +
  
  # Annotate the shaded area with the probability
  annotate("text", x = 88, y = 0.02, label = paste("Probability =", round(probability, 3)), color = "red") +
  
  # Labels and title
  labs(title = "Probability Distribution",
       x = "Expenditure per Customer ($)",
       y = "Density") +
  
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Q2 Hypothesis Testing for Human Temperature (Population Mean)

Samples of patient temperatures reveal that the historical (sample) mean body temperature is 98.6°F with a (sample) standard deviation of 1°F. A sample of 100 patients is taken, and the mean body temperature is found to be 98.4°F. The question is whether this sample reflects a reduction in the population mean body temperature?

We will apply the Z formula for sample means here.

Z-Score Calculation: \[Z = \dfrac{\bar X - \mu}{\sigma}\]

Where:

  • \(\bar X\) is the sample mean,

  • \(\mu\) is the population mean,

  • SE is the standard error.

Given a sample mean and sample standard deviation, we can utilize the Central Limit Theorem (CLT). According to the CLT, the distribution of sample means will be centered at the population mean and approximately follow a normal distribution. To calculate the standard error, we divide the sample standard deviation by the square root of the sample size (\(SE=​\dfrac{s}{\sqrt n}\)).

We are interested in finding the probability that the sample mean (\(\bar X\)) is less than or equal to 98.4, given a population mean \(\mu\) of 98.6 and a standard deviation (\(\sigma\)) of \(\dfrac{1}{sqrt 10}\).

Probability Calculation:

\[P(\bar X \leq 98.4)\]

We know this \(\bar X\) should be normally distributed at \(N(\mu = 98.6 , \sigma = \dfrac{1}{\sqrt 100} )\).

      pnorm(q = 98.4, mean = 98.6, sd = 1/sqrt(100), lower.tail = TRUE)   # area to the left of 98.4 given that we know the distribution
## [1] 0.02275013
1  -  pnorm(q = 98.4, mean = 98.6, sd = 1/sqrt(100), lower.tail = FALSE)  # another way to get the same area
## [1] 0.02275013

Alternatively, we could have used the Z distribution (standard normal) which will give us the same answer. We want to calculate \[P(Z \leq z\_score)\]

where we know Z is normally distributed

\[N(\mu=0, \sigma = standard \ error)\]

z_statistic <- (98.4 - 98.6) / (1/sqrt(100)) # plugging in the z statistic formula - note you take the sample mean and subtract the hypothesized value (population mean) from it, and are dividing it by the standard error as we are looking at the distribution of sample means (CLT kicks in), not distribution of the population (which we do not know) 

pnorm(q = z_statistic)  
## [1] 0.02275013
pnorm(q = z_statistic, mean = 0, sd = 1)  # same answer as without specifying mean and sd as the default value of mean and sd is 0 and 1 respectively in pnorm function
## [1] 0.02275013

Summary of the question -

# Given values
population_mean <- 98.6
sample_mean <- 98.4
sample_sd <- 1.0
sample_size <- 100

# Calculate standard error
se <- sample_sd / sqrt(sample_size)

# Calculate Z-score
z_score <- (sample_mean - population_mean) / se

# Calculate the probability using Z-score
probability <- pnorm(z_score)

# Print the probability
print(probability)
## [1] 0.02275013

EMPERICAL RULE

?pnorm

pnorm(1)  - pnorm(-1)  # 68.26% observations within 1 sd of the mean
## [1] 0.6826895
pnorm(2)  - pnorm(-2)  # 95.44% observations within 2 sd of the mean
## [1] 0.9544997
pnorm(3)  - pnorm(-3)  # 99.73% observations within 3 sd of the mean
## [1] 0.9973002

T distribution

The t-distribution is similar to the standard normal distribution but takes into account the degrees of freedom. In situations where the sample size (n\) is small, the t-distribution has fatter tails compared to the standard normal distribution.

However, as the sample size increases, the t-distribution approaches the standard normal distribution. It becomes nearly normal when n\ is greater than 30 and practically normal when n\ exceeds 120 observations.

The adjustment for degrees of freedom in the t-distribution accounts for the variability introduced by smaller sample sizes.

?rt
myt <-      rt(n = 1000, df = 2)    
plot(x = density(myt))

?rnorm
mynorm <-   rnorm(n = 1000)        # plot normal distribution
plot(x = density(mynorm))

Lets plot the t and normal distribution side by side

?par 
par(mfrow = c(2,1))      

# lets try to standardize the axis for easier eyeballing

plot(x = density(myt),    xlim=c(-10,10), ylim=c(0,.4))
plot(x = density(mynorm), xlim=c(-10,10), ylim=c(0,.4))

T has fatter tails when degrees of freedom is 2 ie \(df = n-1 = 2\), and extreme values (see below).

plot(x = density(myt))      

plot(x = density(mynorm))

Now Chnage DF of STUDENT T to a higher number.

# (try df=30,df=120,... ) 
myt <- rt(n = 1000, df = 150)

plot(density(myt),    xlim=c(-5,5))

plot(density(mynorm), xlim=c(-5,5))

Almost same. Even the critical values are the same when df is large, as expected (see below).

qnorm(p = .99)  # the point on the z distribution (z critical value) above which only 1% of the values lie 
## [1] 2.326348
qt(p = .99, df = 10000)  # when df is high (greater than 120), t distribution has the same cutoff points !
## [1] 2.326721
qt(p = .99, df = 150)    # when df is high (greater than 120), t distribution has the same cutoff points !
## [1] 2.351465
qt(p = .99, df = 120)    # when df is high (greater than 120), t distribution has the same cutoff points !
## [1] 2.357825
qt(p = .99, df = 80)     # some deviation
## [1] 2.373868
qt(p = .99, df = 50)     # some deviation
## [1] 2.403272
qt(p = .99, df = 10)
## [1] 2.763769
qt(p = .99, df = 5)      # when df is low (less than 30), t distribution has very different cutoff points !
## [1] 3.36493