Comparing 2 Population Means

Examples:

Matched samples vs Independent samples

Matched Samples

Observations are in pairs: each observation is matched to another observation For example, comparing one flight from my home town to LA to another flight from my home town to LA. One flight is linked to another by the destination city; you can compare them. The two categories will never have a different number observations. It is possible to calculate the difference between the two points and create a sample with only one column. Once we have only one column, we can then apply the things we have learned from our previous studies. Our test statistic is d bar (replacing x bar) – the average of the difference between the two values Standard deviation is sd – the standard deviation of the difference between the two values

Independent Samples

There is no relationship between observations in one sample and in the other sample.
For example, a new golf ball that was designed to provide higher driving distance. Comparing new golf ball distance to old golf ball distance. There is no relationship; you cannot compare observations. If there is a different number of observations it must be independent.
However, if there is a same number of observations, it could be either.

Calculations with Matched Samples

It is possible to calculate the difference between the two points and create a sample with only one column. Once we have only one column, we can then apply the things we have learned from our previous studies.

Our test statistic is d bar (replacing x bar) – the average of the difference between the two values

Standard deviation is sd – the standard deviation of the difference between the two values

Example:

Comparing current quarter’s earnings with the previous quarter’s earnings.

Problem:

Provide a 95% confidence interval estimate betwen the two quarter.

first import data

library(readxl)
Earnings2005 <- read_excel("Earnings2005.xlsx", 
    skip = 1)
View(Earnings2005)

Looking at the data, I can see that the observations are given in pairs. Therefore, I can compute the difference between each pair and treat these differences as a single population.

Compute the differences:

Earnings2005$Difference = Earnings2005$Current - Earnings2005$Previous

Earnings2005

Now, how to calculate the confidence interval?

We will be using d bar – the mean of the differences – instead of x bar.

And we will be using d standard dveiation

because we don’t have the standard deviation of the population, we need the t distribution so we will need the variable n-1

dbar = mean(Earnings2005$Difference)
sd = sd(Earnings2005$Difference)
n = length(Earnings2005$Difference)
significance = 0.05

t = qt(significance/2, n-1, lower.tail=F)


 margin_of_error = t*sd/sqrt(n)
 
 
 lowerlimit = round(dbar - margin_of_error, digits=4)
 upperlimit = round(dbar + margin_of_error, digits=4)

 
 print(paste("95% confidence interval of the difference between the population mean of the current quarter vs the previous quarter is between", lowerlimit, "and", upperlimit ))
## [1] "95% confidence interval of the difference between the population mean of the current quarter vs the previous quarter is between 0.0969 and 0.3159"

Now let’s run a hypothesis test: have earnings increased between the previous quarter and this quarter?

\(mu1\) = mu1 is the current per-share earning \(mu2\) = mu2 is the previous per-share earning then \(mud\) = mud = mu1 - mu2 = current - previous

H0 : ud <= 0 HA: ud > 0

we are checking: is the difference significant or could it be explained by random variation?

we don’t know the standard deviation so we have to use the t distribution

mu0 = 0
t = (dbar - mu0) / (sd/sqrt(n))
t
## [1] 3.889064

use critical value approach

# upper tail test
t_critical = qt(significance, n-1, lower.tail=F)
t_critical
## [1] 1.710882

since t statistic > t critical value, we are in the rejection leve, and reject H0 at a 0.05 level of significance. HA is correct: earnings have increased

pvalue approach:

p = pt(t, n-1, lower.tail=F)
p
## [1] 0.0003485371

since p value < 0.05, reject H0 at a 5% level of significance: earnings have increased

the smaller p value, the stronger the evidence most people would agree this is strong evidence

alternative: use t.test

# for confidence interval, use two-sided
# for hypothesis testing, use g = greater then
# so this p value applies to whether there is a difference-- not to our question of whether earnings have increased
t.test(Earnings2005$Current, Earnings2005$Previous, alternative='t', mu=0, paired=TRUE)
## 
##  Paired t-test
## 
## data:  Earnings2005$Current and Earnings2005$Previous
## t = 3.8891, df = 24, p-value = 0.0006971
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09686498 0.31593502
## sample estimates:
## mean of the differences 
##                  0.2064
# in order to hypothesis test greater then, must use alternative=g
# note: the p value in this result is meaningless
t.test(Earnings2005$Current, Earnings2005$Previous, alternative='g', mu=0, paired=TRUE)
## 
##  Paired t-test
## 
## data:  Earnings2005$Current and Earnings2005$Previous
## t = 3.8891, df = 24, p-value = 0.0003485
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1156002       Inf
## sample estimates:
## mean of the differences 
##                  0.2064

Independent samples

use sample mean x1bar to estimate population mean mu1 \(mu1\) use sample mean x2bar to estimate population mean mu2 \(mu2\)

to do confidence intervals and test hypothesis for mu1 - mu2, we nned to know the sampling distribution of x1bar - x2bar

how to handle standard deviation? we cannot add standard deviations but we can add the variances variance is variability variability adds up for two distributions

also, degrees of freedom has a much more complicated equation

example:

sample 1: n1 = 120 balls, x1bar = 275 yards, sd1 = 15 yards sample 2: n2 = 80 balls, x2bar = 258 yards, sd2 = 20 yards

construct a 95% confidence interval for the difference between the driving distance of the 2 products

point_estimate = 275 - 258
point_estimate
## [1] 17
z = qnorm(0.025, lower.tail=F)


margin_of_error = z * sqrt(15^2/120+20^2/80)


lowerlimit = point_estimate - margin_of_error
upperlimit = point_estimate + margin_of_error



print(paste("95% confidence interval for the difference between the driving distance of the two products is between", lowerlimit, "and", upperlimit))
## [1] "95% confidence interval for the difference between the driving distance of the two products is between 11.8609310772989 and 22.1390689227011"

now do a hypothesis test:

h0: mu1 - mu2 <= 0 ha: mu1 - mu2 > 0

z = ((275 - 258) - 0)/sqrt(15^2/120+20^2/80)
z
## [1] 6.483546

critical value approach

# upper tail test
z_critical = qnorm(0.01, lower.tail=F)
z_critical
## [1] 2.326348

z value is bigger then critical value therefore in the rejecgtion zone therefore reject H0 at 1% significance the mean driving distance is greater then the distance of the competitor

p value approach

# again, upper test-- looking at upper tail

p = pnorm(z, lower.tail=F)
p
## [1] 4.479589e-11

p value is less then 1%– it’s a very very small number– so we reject H0 at the 1% level of significnacne. the mean driving distance is better

what if we don’t know the standard deviation?

we have to use the t distribution

the calculation for degree of freedom is horrible!!!!

example: wage distribution case

over five years, do male employees have a mean hourly wage exceeding that of the female employees?

H0: male wage <= female wage HA: male wage > female wage

for males: n1 = 44 xbar1 = 9.25 s1 = 1.00

for females: n2 = 32 xbar2 = 8.70 s2 = 0.80

mu1 = mean hourly wage of female employees mu2 = mean hourly wage of female employees

H0: mu1 - mu2 <= 0 HA: mu1 - mu2 > 0

calculate test statistic

t = ((9.25 - 8.70)-0)/ sqrt(1^2/44+0.80^2/32)
t
## [1] 2.660787

calculate degrees of freedom

df calculator function

df_f=function(s1,s2,n1,n2){floor((s1^2/n1+s2^2/n2)^2/(1/(n1-1)*(s1^2/n1)^2+1/(n2-1)*(s2^2/n2)^2))}
dof = df_f(1.00, 0.80, 44, 32)
dof
## [1] 73
t_critical = qt(0.01, dof, lower.tail = F)
t_critical
## [1] 2.378522

since t value > t critical value, reject H0 at a 0.01 level of significance. Male employees have a mean hourly wage exceeding that of female employees. discrimination appears to be present in this case at a 0.01 level of significance