Examples:
Income gap between Men/Women
Difference in starting salary between College/High school graduates
Difference in performance between two products
Customer perception before and after a promotion campaign
Matched Samples
Observations are in pairs: each observation is matched to another observation For example, comparing one flight from my home town to LA to another flight from my home town to LA. One flight is linked to another by the destination city; you can compare them. The two categories will never have a different number observations. It is possible to calculate the difference between the two points and create a sample with only one column. Once we have only one column, we can then apply the things we have learned from our previous studies. Our test statistic is d bar (replacing x bar) – the average of the difference between the two values Standard deviation is sd – the standard deviation of the difference between the two values
Independent Samples
There is no relationship between observations in one sample and in the other sample.
For example, a new golf ball that was designed to provide higher driving distance. Comparing new golf ball distance to old golf ball distance. There is no relationship; you cannot compare observations. If there is a different number of observations it must be independent.
However, if there is a same number of observations, it could be either.
Calculations with Matched Samples
It is possible to calculate the difference between the two points and create a sample with only one column. Once we have only one column, we can then apply the things we have learned from our previous studies.
Our test statistic is d bar (replacing x bar) – the average of the difference between the two values
Standard deviation is sd – the standard deviation of the difference between the two values
Example:
Comparing current quarter’s earnings with the previous quarter’s earnings.
Problem:
Provide a 95% confidence interval estimate betwen the two quarter.
first import data
library(readxl)
Earnings2005 <- read_excel("Earnings2005.xlsx",
skip = 1)
View(Earnings2005)
Looking at the data, I can see that the observations are given in pairs. Therefore, I can compute the difference between each pair and treat these differences as a single population.
Compute the differences:
Earnings2005$Difference = Earnings2005$Current - Earnings2005$Previous
Earnings2005
Now, how to calculate the confidence interval?
We will be using d bar – the mean of the differences – instead of x bar.
And we will be using d standard dveiation
because we don’t have the standard deviation of the population, we need the t distribution so we will need the variable n-1
dbar = mean(Earnings2005$Difference)
sd = sd(Earnings2005$Difference)
n = length(Earnings2005$Difference)
significance = 0.05
t = qt(significance/2, n-1, lower.tail=F)
margin_of_error = t*sd/sqrt(n)
lowerlimit = round(dbar - margin_of_error, digits=4)
upperlimit = round(dbar + margin_of_error, digits=4)
print(paste("95% confidence interval of the difference between the population mean of the current quarter vs the previous quarter is between", lowerlimit, "and", upperlimit ))
## [1] "95% confidence interval of the difference between the population mean of the current quarter vs the previous quarter is between 0.0969 and 0.3159"
Now let’s run a hypothesis test: have earnings increased between the previous quarter and this quarter?
\(mu1\) = mu1 is the current per-share earning \(mu2\) = mu2 is the previous per-share earning then \(mud\) = mud = mu1 - mu2 = current - previous
H0 : ud <= 0 HA: ud > 0
we are checking: is the difference significant or could it be explained by random variation?
we don’t know the standard deviation so we have to use the t distribution
mu0 = 0
t = (dbar - mu0) / (sd/sqrt(n))
t
## [1] 3.889064
use critical value approach
# upper tail test
t_critical = qt(significance, n-1, lower.tail=F)
t_critical
## [1] 1.710882
since t statistic > t critical value, we are in the rejection leve, and reject H0 at a 0.05 level of significance. HA is correct: earnings have increased
p = pt(t, n-1, lower.tail=F)
p
## [1] 0.0003485371
since p value < 0.05, reject H0 at a 5% level of significance: earnings have increased
the smaller p value, the stronger the evidence most people would agree this is strong evidence
# for confidence interval, use two-sided
# for hypothesis testing, use g = greater then
# so this p value applies to whether there is a difference-- not to our question of whether earnings have increased
t.test(Earnings2005$Current, Earnings2005$Previous, alternative='t', mu=0, paired=TRUE)
##
## Paired t-test
##
## data: Earnings2005$Current and Earnings2005$Previous
## t = 3.8891, df = 24, p-value = 0.0006971
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09686498 0.31593502
## sample estimates:
## mean of the differences
## 0.2064
# in order to hypothesis test greater then, must use alternative=g
# note: the p value in this result is meaningless
t.test(Earnings2005$Current, Earnings2005$Previous, alternative='g', mu=0, paired=TRUE)
##
## Paired t-test
##
## data: Earnings2005$Current and Earnings2005$Previous
## t = 3.8891, df = 24, p-value = 0.0003485
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.1156002 Inf
## sample estimates:
## mean of the differences
## 0.2064
use sample mean x1bar to estimate population mean mu1 \(mu1\) use sample mean x2bar to estimate population mean mu2 \(mu2\)
to do confidence intervals and test hypothesis for mu1 - mu2, we nned to know the sampling distribution of x1bar - x2bar
how to handle standard deviation? we cannot add standard deviations but we can add the variances variance is variability variability adds up for two distributions
also, degrees of freedom has a much more complicated equation
example:
sample 1: n1 = 120 balls, x1bar = 275 yards, sd1 = 15 yards sample 2: n2 = 80 balls, x2bar = 258 yards, sd2 = 20 yards
construct a 95% confidence interval for the difference between the driving distance of the 2 products
point_estimate = 275 - 258
point_estimate
## [1] 17
z = qnorm(0.025, lower.tail=F)
margin_of_error = z * sqrt(15^2/120+20^2/80)
lowerlimit = point_estimate - margin_of_error
upperlimit = point_estimate + margin_of_error
print(paste("95% confidence interval for the difference between the driving distance of the two products is between", lowerlimit, "and", upperlimit))
## [1] "95% confidence interval for the difference between the driving distance of the two products is between 11.8609310772989 and 22.1390689227011"
now do a hypothesis test:
h0: mu1 - mu2 <= 0 ha: mu1 - mu2 > 0
z = ((275 - 258) - 0)/sqrt(15^2/120+20^2/80)
z
## [1] 6.483546
critical value approach
# upper tail test
z_critical = qnorm(0.01, lower.tail=F)
z_critical
## [1] 2.326348
z value is bigger then critical value therefore in the rejecgtion zone therefore reject H0 at 1% significance the mean driving distance is greater then the distance of the competitor
p value approach
# again, upper test-- looking at upper tail
p = pnorm(z, lower.tail=F)
p
## [1] 4.479589e-11
p value is less then 1%– it’s a very very small number– so we reject H0 at the 1% level of significnacne. the mean driving distance is better
we have to use the t distribution
the calculation for degree of freedom is horrible!!!!
over five years, do male employees have a mean hourly wage exceeding that of the female employees?
H0: male wage <= female wage HA: male wage > female wage
for males: n1 = 44 xbar1 = 9.25 s1 = 1.00
for females: n2 = 32 xbar2 = 8.70 s2 = 0.80
mu1 = mean hourly wage of female employees mu2 = mean hourly wage of female employees
H0: mu1 - mu2 <= 0 HA: mu1 - mu2 > 0
calculate test statistic
t = ((9.25 - 8.70)-0)/ sqrt(1^2/44+0.80^2/32)
t
## [1] 2.660787
calculate degrees of freedom
df calculator function
df_f=function(s1,s2,n1,n2){floor((s1^2/n1+s2^2/n2)^2/(1/(n1-1)*(s1^2/n1)^2+1/(n2-1)*(s2^2/n2)^2))}
dof = df_f(1.00, 0.80, 44, 32)
dof
## [1] 73
t_critical = qt(0.01, dof, lower.tail = F)
t_critical
## [1] 2.378522
since t value > t critical value, reject H0 at a 0.01 level of significance. Male employees have a mean hourly wage exceeding that of female employees. discrimination appears to be present in this case at a 0.01 level of significance