In this lab exercise, you will learn:
The data set used for this exercise is CPS96_15 from E3.1 of Stock and Watson (2020, e4). This data contains 13,201 observations on full-time workers, ages 25-34, with a high school diploma or a B.A./B.S. as their highest degree. A detailed description is given in CPS96_15_Description, available in LMS.
rm(list=ls())
Let’s load all the packages needed for this lab exercise (this assumes you’ve already installed them).
#install.packages("openxlsx") # install R package "openxlsx"
library(openxlsx) # load the package
## Warning: package 'openxlsx' was built under R version 4.3.3
id <- "1dJfsjD9hi90QPjucV8RvKcxgm-kc0Qte"
cps <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(cps)
## 'data.frame': 13201 obs. of 5 variables:
## $ year : num 1996 1996 1996 1996 1996 ...
## $ ahe : num 11.17 8.65 9.62 11.22 9.62 ...
## $ bachelor: num 0 0 1 1 1 1 0 1 0 0 ...
## $ female : num 0 1 1 0 1 0 1 0 0 1 ...
## $ age : num 31 31 27 26 28 32 30 31 31 25 ...
Description of variables:
year: year
ahe: average hourly earnings
bachelor: 1 if worker has a bachelor’s degree; 0 if worker has a high school degree
female: 1 if female; 0 if male
age: age
Get the data for year 1996.
cps_96 <- subset(cps, year==1996)
str(cps_96)
## 'data.frame': 6103 obs. of 5 variables:
## $ year : num 1996 1996 1996 1996 1996 ...
## $ ahe : num 11.17 8.65 9.62 11.22 9.62 ...
## $ bachelor: num 0 0 1 1 1 1 0 1 0 0 ...
## $ female : num 0 1 1 0 1 0 1 0 0 1 ...
## $ age : num 31 31 27 26 28 32 30 31 31 25 ...
Exercise-1: Get a subset of the data on high school graduates for year 1996.
cps_hs_96 <- subset(cps, bachelor==0 & year==1996)
str(cps_hs_96)
## 'data.frame': 3484 obs. of 5 variables:
## $ year : num 1996 1996 1996 1996 1996 ...
## $ ahe : num 11.17 8.65 14.79 15.89 9.13 ...
## $ bachelor: num 0 0 0 0 0 0 0 0 0 0 ...
## $ female : num 0 1 1 0 1 0 0 1 0 0 ...
## $ age : num 31 31 30 31 25 33 29 32 34 25 ...
Select the variable ahe (average hourly earnings)
ahe_96 <- cps_96$ahe
Exercise-2: Get a subset for year 2015 and select the variable ahe. Define the selected variable as ahe_15.
cps_15 <- subset(cps, year==2015)
ahe_15 <- cps_15$ahe
Obtain the summary statistics, including sample size (n), sample mean (\(\overline{Y}_{ahe}\), denoted by yb), sample standard deviation (\(\sigma^2_{ahe}\), denoted by sig), as well as standard error of the sample mean (\(\sigma^2_{ahe}/\sqrt{n}\), denoted by se), of ahe for year 1996 and 2015.
n_96 <- length(ahe_96)
yb_96 <- mean(ahe_96)
sig_96 <- sd(ahe_96)
se_96 <- sd(ahe_96)/sqrt(n_96)
The formula of CI for the population mean is \[\overline{Y} \pm z_{\alpha/2} \cdot \frac{{\hat\sigma_{Y}}}{\sqrt{n}}\] where \(z_{\alpha/2}\) can be obtained by R using qnorm(\(\alpha/2\)). For example, for a 95% CI, \(\alpha=0.05\), then
z <- qnorm(0.05/2)
print(z)
## [1] -1.959964
Construct a 95% CI for the population mean of ahe in 1996:
yb_96+z*se_96
## [1] 12.53372
yb_96-z*se_96
## [1] 12.8528
Exercise-3: Construct a 95% CI for ahe in 2015.
n_15 <- length(ahe_15) # sample size: n
yb_15 <- mean(ahe_15) # sample mean: y_bar
sig_15 <- sd(ahe_15) # sample sd of y: sigma
se_15 <- sd(ahe_15)/sqrt(n_15) # sample se of y_bar: sigma/sqrt(n)
# 95% CI #
yb_15+z*se_15
## [1] 20.95538
yb_15-z*se_15
## [1] 21.5195
In fact, CIs can be directly computed by R using t.test.
t.test(ahe_96)$conf.int
## [1] 12.53369 12.85283
## attr(,"conf.level")
## [1] 0.95
The default confidence level in t.test is 95%. If you want a confidence level different from 95%, you can set the confidence level in t.test using conf.level=. For example, construct a 90% CI for ahe in 1996:
t.test(ahe_96, conf.level = 0.9)$conf.int
## [1] 12.55935 12.82717
## attr(,"conf.level")
## [1] 0.9
Exercise-4: Using t.test, construct a 90% CI for ahe in 2015. Compare the 90% CI and the 95% CI, which one is wider?
t.test(ahe_15, conf.level = 0.9)$conf.int
## [1] 21.00069 21.47418
## attr(,"conf.level")
## [1] 0.9
Given that \(Y_a\) and \(Y_b\) are independent, the formula of CI for the population mean is \[(\overline{Y}_b - \overline{Y}_a) \pm z_{\alpha/2} \cdot \sqrt{\frac{{\hat\sigma_{Y_b}^2}}{n_b} + \frac{{\hat\sigma_{Y_a}^2}}{n_a}} \] where \(\hat{\sigma}_a^2 = \frac{1}{n_a-1}\sum_{i=1}^{n} (Y_{a,i} - \overline{Y}_a)^2\), \(\hat{\sigma}_b^2 = \frac{1}{n_b-1}\sum_{i=1}^{n} (Y_{b,i} - \overline{Y}_b)^2\).
Using t.test, construct a 95% CI for the change in the population means of ahe between 1996 and 2015:
t.test(ahe_15, ahe_96)$conf.int
## [1] 8.220087 8.868268
## attr(,"conf.level")
## [1] 0.95
For any hypothesis testing, you first need to specify the null
hypothesis and the alternative hypothesis. In the example of testing for
the population mean, we have \(H_0:
\mu_{\bar{Y}} = 0\) vs \(H_a:
\mu_{\bar{Y}} \neq 0\). Under the null hypothesis, the
t-statistic is: \[t =
\frac{\overline{Y}-\mu_{0}}{{\hat\sigma_{\overline{Y}}}}\]
The t-statistic is \[t = \frac{\overline{Y}_{ahe,96} - 0}{\hat\sigma_{\overline{Y}_{ahe,96}}}, \text{ under } H_0.\]
mu_0 <- 0 # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)
## [1] 155.9386
The p-value is \[p\text{-value} = 2\cdot \Pr(Z\leq-|t^{act}|) = 2\cdot \Pr(Z\leq-155.94).\]
p_96 <- 2*pnorm(-abs(t_96))
print(p_96)
## [1] 0
For a standard normal distribution, \(\Pr(Z \leq z)\) can be computed using pnorm(z). For a \(t\)-distribution with degrees of freedom \(n-k\), \(\Pr(X \leq x)\) can be computed using pt(x, n-k). For an \(F\)-distribution with degrees of freedom \(d_1\) and \(d_2\), pf(x, d1, d2) can be used.
Conclusion: Since the p-value is 0 (\(<0.05\)), we reject the null hypothesis at 5% significance level (\(\alpha=0.05\)).
Use t.test:
t.test(ahe_96)
##
## One Sample t-test
##
## data: ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 12.53369 12.85283
## sample estimates:
## mean of x
## 12.69326
t.test(ahe_96, conf.level=0.9) # at 10% significance level
##
## One Sample t-test
##
## data: ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 12.55935 12.82717
## sample estimates:
## mean of x
## 12.69326
Exercise-5: Conduct the following hypothesis test: \(H_0: \mu_{ahe,96} = 12.7\) vs \(H_a: \mu_{ahe,96} \neq 12.7\).
Hint: The t-statistic is: \[t = \frac{\overline{Y}_{ahe,96} - 12.7}{\hat\sigma_{\overline{Y}_{ahe,96}}}, \text{ under } H_0.\]
mu_0 <- 12.7 # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)
## [1] -0.08279351
The p-value is
p_96 <- 2*pnorm(-abs(t_96))
print(p_96)
## [1] 0.9340157
Conclusion: Since the p-value is much greater than \(0.05\), there is not enough evidence to support the alternative hypothesis. Therefore, we do not reject the null hypothesis at 5% significance level.
Use t.test:
t.test(ahe_96, mu=12.7)
##
## One Sample t-test
##
## data: ahe_96
## t = -0.082794, df = 6102, p-value = 0.934
## alternative hypothesis: true mean is not equal to 12.7
## 95 percent confidence interval:
## 12.53369 12.85283
## sample estimates:
## mean of x
## 12.69326
For an upper-tailed test, \[H_0: \mu_{\bar{Y}} = \mu_0 \quad\text{vs} \quad H_a: \mu_{\bar{Y}} > \mu_0.\] For a lower-tailed test, \[H_0: \mu_{\bar{Y}} = \mu_0 \quad\text{vs} \quad H_a: \mu_{\bar{Y}} < \mu_0.\] The p-value can be computed by:
upper-tailed: \(p\text{-value}= \Pr(Z > t^{act}) = 1- \Phi(t^{act})\)
lower-tailed: \(p\text{-value}= \Pr(Z\leq t^{act}) = \Phi(t^{act})\)
The t-statistics are the same for both two-tailed test and one-tailed test. From Example 1, we have \(t_{96}=155.94\). Thus, the p-value for this upper-tailed test is \[p\text{-value} = \Pr(Z\geq t^{act}) = \Pr(Z\geq 155.94).\]
p_96 <- 1-pnorm(155.94)
print(p_96)
## [1] 0
Conclusion: Since the p-value is less than 0.05, we reject the null hypothesis at 5% significance level (\(\alpha=0.05\)).
Use t.test.
t.test(ahe_96, alternative="greater")
##
## One Sample t-test
##
## data: ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 12.55935 Inf
## sample estimates:
## mean of x
## 12.69326
Exercise-6: Conduct the following hypothesis test: \(H_0: \mu_{ahe,96} = 13\) vs \(H_a: \mu_{ahe,96} < 13\).
Hint: In t.test, we set alternative=“less”.
The t-statistics is
mu_0 <- 13 # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)
## [1] -3.768339
The p-value for this lower-tailed test is \[p\text{-value} = \Pr(Z\leq t^{act}) = \Pr(Z\leq -3.77).\]
p_96 <- pnorm(t_96)
print(p_96)
## [1] 8.216877e-05
Conclusion: Since the p-value is less than 0.05, we reject the null hypothesis at 5% significance level (\(\alpha=0.05\)).
Use t.test.
t.test(ahe_96, mu=13, alternative="less")
##
## One Sample t-test
##
## data: ahe_96
## t = -3.7683, df = 6102, p-value = 8.294e-05
## alternative hypothesis: true mean is less than 13
## 95 percent confidence interval:
## -Inf 12.82717
## sample estimates:
## mean of x
## 12.69326