Here I will present a few statistical methods for dealing with large & small datasets, specifically the: - Central Limit Theorem (CLT) - Z statistic - Student’s or Gosset’s t distribution - t confidence intervals
Necessary Background: CTL & Z 1. Confidence intervals using the Central Limit Theorem (CLT) and normal distributions have large sample sizes, and the formula for computing the confidence interval was Est +/- qnorm *std error(Est), where Est was some estimated value (such as a sample mean) with a standard error. Here qnorm represents a specified quantile from a normal distribution.
Z statistic Z=(X'-mu)/(sigma/sqrt(n)) follows a standard normal distribution. This normalized statistic Z is especially nice because we know its mean and variance are 0 and 1 respectively.The formula for computing a t confidence interval does not use quantile for a normal distribution.
t distribution uses the formula is Est +/- t-quantile*std error(Est). The other distinction, is it uses the the sample standard deviation when we estimating the standard error of Est and t=(X'-mu)/(s/sqrt(n))were s is the standard deviation.Compare t to z
As the number of degrees of freedom (df) df increases, the t distribution gets more like a standard normal, so it’s centered around 0. Also, the t assumes that the underlying data are iid Gaussian so the statistic (X’ - mu)/(s/sqrt(n)) has n-1 degrees of freedom.Looking at their quantiles,
The distance between the two thick lines represents the difference in sizes between the quantiles and hence the two sets of intervals. Note the thin horizontal and vertical lines. These represent the .975 quantiles for the t and normal distributions respectively, a placement of the vertical at 1.96 from 0, the mean. As n increased the distributions are overlapping, in higher degrees of freedom.
compare t & z
5.the t interval is always wider than the normal. This is because estimating the standard deviation introduces more uncertainty so a wider interval results.
So the t-interval is defined as X' +/- t_(n-1)*s/sqrt(n) where t_(n-1) is the relevant quantile. The t interval assumes that the data are iid normal, though it is robust to this assumption and works well whenever the distribution of the data is roughly symmetric and mound shaped.
For skewed distributions, the spirit of the t interval assumptions (being centered around 0) are violated. There are ways of working around this problem (such as taking logs or using a different summary like the median).
paired observations are often analyzed using the t interval by taking differences between the observations.
##D sleep data: This was the data originally analyzed in Gosset’s Biometrika paper, which shows the increase in hours for 10 patients on two soporific drugs.
data(sleep)
x1 <- sleep$extra[sleep$group == 1]
x2 <- sleep$extra[sleep$group == 2]
n1 <- length(x1)
n2 <- length(x2)
sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2))
md <- mean(x1) - mean(x2)
semd <- sp * sqrt(1 / n1 + 1/n2)
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd
## [1] -3.363874 0.203874
t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf
## [1] -3.363874 0.203874
## attr(,"conf.level")
## [1] 0.95
t.test(x1, x2, paired = TRUE)$conf
## [1] -2.4598858 -0.7001142
## attr(,"conf.level")
## [1] 0.95
Here we’ve plotted the data in a paired way, connecting each patient’s two results with a line, group 1 results on the left and group 2 on the right. See that purple line with the steep slope?
If we just looked at the 20 data points we’d be comparing group 1 variations with group 2 variations. Both groups have quite large ranges. However, when we look at the data paired for each patient, we see that the variations in results are usually much smaller and depend on the particular subject.
range(x1)
## [1] -1.6 3.7
#[1] -1.6 3.7
range(x2)
## [1] -0.1 5.5
#[1] -0.1 5.5
The pairwise difference can be calculated using R’s componentwise subtraction of vectors and creating a vector of difference by subtracting x1 from x2.
difference <- x2-x1
mean(difference)
## [1] 1.58
#[1] 1.58
mn <- mean(difference)
s <- sd(difference)
mn + c(-1,1)*qt(.975,9)*s/sqrt(10)
## [1] 0.7001142 2.4598858
#[1] 0.7001142 2.4598858
The mean difference of this paired data is is much smaller than the group variations. With probability .95 the average difference of effects (between the two drugs) for an individual patient is between .7 and 2.46 additional hours of sleep.
#Other ways to do the t test in R:
as.vector(t.test(difference)$conf.int)
## [1] 0.7001142 2.4598858
#[1] 0.7001142 2.4598858
#attr(,"conf.level")
#[1] 0.95
#as.vector(t.test(x2, x1, paired = TRUE)$conf.int),
#as.vector(t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)$conf.int)
Suppose that we want to compare the mean blood pressure between two groups in a randomized trial. We’ll compare those who received the treatment to those who received a placebo. - the groups are independent and may have different sample sizes.
If the difference is mu_y - mu_x, and we tknow that t= X’ +/- t_(n-1)*s/sqrt(n), and we have the means, X’ and Y’, from each group. We take their difference (Y’-X’), we are looking for a confidence interval that contains the difference mu_y-mu_x. Now we need to specify a t quantile. Suppose the groups have different sizes n_x and n_y with: - t_(.975,n_x+n_y-2)
We call the variance estimator we use the pooled variance. The formula for it requires two variance estimators (in the form of the standard deviation), S_x and S_y, one for each group. We multiply each by its respective degrees of freedom and divide the sum by the total number of degrees of freedom. This weights the respective variances; those coming from bigger samples get more weight as follows: -(n_x-1)(S_x)2+(n_y-1)(S_y)2
With (n_x-1)+(n_y-1) degrees of freedom.To get the the 1/sqrt(n) portion, we can simply add 1/n_x and 1/n_y and take the square root of the sum. Then we MULTIPLY this by the sample variance to complete the estimate of the standard error.
-compare blood pressure from two independent groups. - \(\bar X_{OC} = 132.86\) mmHg with \(s_{OC} = 15.34\) mmHg - \(\bar X_{C} = 127.44\) mmHg with \(s_{C} = 18.23\) mmHg
compute the numerator of the pooled sample variance by weighting the sum of the two by their respective sample size with the formula (n_x-1)(S_x)2+(n_y-1)(S_y)2. The first is a group of 8 oral contraceptive users and the second is a group of 21 controls.
#scratch
sp <- 7 * 15.34^2 + 20 * 18.23^2
ns <-(8 + 21 - 2)*sp
sp <- sqrt(sp/ns)
sp
## [1] 0.1924501
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2))
132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5
## [1] -9.521097 20.361097
The standard error SE and the quantile t_df are calculated differently from previous methods. - Y’-X’ +/- t_df * SE
where as before Y’-X’ represents the difference of the sample means. Here SE is the square root of the sum of the squared standard errors of the two means, (s_1)^2/n_1 + (s_2)^2/n_2.
When the underlying X and Y data are iid normal and the variances are different, the normalized statistic we started this lesson with, (X’-mu)/(s/sqrt(n)), doesn’t follow a t distribution.
Formula for Groups with Unequal Variances
However, it can be approximated by a t distribution if we set the degrees of freedom appropriately, by the above formula. The numerator is the SQUARE of the sum of the squared standard errors of the two sample means. Each term looks like (s4/n2)/(n-1), we use this df to find the t quantile.
#Our numbers were 15.34 with size 8 and 18.23 with size 21.
num <- (15.34^2/8 + 18.23^2/21)^2
den <- 15.34^4/8^2/7 + 18.23^4/21^2/20
mydf <- num/den
132.86-127.44 +c(-1,1)*qt(.975,mydf)*sqrt(15.34^2/8 + 18.23^2/21)
## [1] -8.913327 19.753327
#[1] -8.913327 19.753327