In the standardized normal the mean and variance are fixed and known. So the formula \[z_c = \frac{\bar{X}-\mu}{\sigma/\sqrt{N}}\] is appropriate for finding a confidence interval for a population mean if two conditions are met:
1. The population standard deviation \(\sigma\) is known and 2. The data is normallly distributed.
In a realistic sutuation the population mean and population standard deviation are not known. So what should be done ?
In such scenerio , we use t statistic which is similar to Z and defined as \[t= \frac{\bar{X}-\mu}{S\sqrt{n}}\]
It is also centered around 0. The only difference between z and t statistic is that the population standard deviaton , \(\sigma\) is replaced by sample standard deviation \(S\). The distirbution of the t statistic is independent of the population mean and variance and it depends on the sample size n.
The following R code , does a comparision of the student t curve and normal distribution curve. With Degree of freedom v =10, the normal curve and t -distribution curve overlap each other.
x = seq(-4,4,0.01)
y = dt(x,df = 9)
y1 = dnorm(x)
y2 = dt(x,df = 10)
y3 = dt(x,df = 1)
plot(x , y, type="l", col = "black",lty=3, main="Normal Dist. vs T-Dist. with degree of freedom as 1,9, 10")
lines(x,y1,col ='blue')
lines(x,y2,col ='green')
lines(x,y3,col ='red')
The T distribution was invented by william Gosset in 1908. The paramter in t distribution is the number of degrees of freedom df. As the value of df increases, t distribution is like normal distribution. Always perfer t distribution over the normal distribution. when sample size large enough both the distributions merges. But t interval isn’t always applicable. For skewed distributions, the assumption of being centered around 0 is violated.Work around this problem is taking logsor using a different summary like the median.
The formula in case of two groups with sample size as \(n_x\) and \(n_y\), standard deviation, \(S_x\) and \(S_y\). The Standard Error or SE is given by
\[SE=\sqrt{\frac{(n_x-1)(S_x)^2+(n_y-1)(S_y)^2}{(n_x+n_y-2)}}* \sqrt{\frac{1}{n_{oc}} + \frac{1}{n_c}}\]
Degree of freedom \[df =(n_x-1)+(n_y-1)\]
Difference in Mean = \({\bar{X}}-\bar{Y}\)