Lab_Ch3 - Review of Statistics (with solutions)

In this lab exercise, you will learn:

How to select a subset of data
- subset
How to compute confidence intervals using R
- qnorm, qt, qf
- t.test
How to conduct a statistical test using R
- pnorm, pt, pf
- t.test

The data set used for this exercise is CPS96_15 from E3.1 of Stock and Watson (2020, e4). This data contains 13,201 observations on full-time workers, ages 25-34, with a high school diploma or a B.A./B.S. as their highest degree. A detailed description is given in CPS96_15_Description, available in LMS.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this lab exercise (this assumes you’ve already installed them).

#install.packages("openxlsx")   # install R package "openxlsx"
library(openxlsx)               # load the package

## Warning: package 'openxlsx' was built under R version 4.3.3

1. Data Preparation

1.1 Import the data

id <- "1dJfsjD9hi90QPjucV8RvKcxgm-kc0Qte"
cps <- read.xlsx(sprintf("https://docs.google.com/uc?id=%s&export=download",id),
                          sheet=1,startRow=1,colNames=TRUE,rowNames=FALSE)
str(cps)

## 'data.frame':    13201 obs. of  5 variables:
##  $ year    : num  1996 1996 1996 1996 1996 ...
##  $ ahe     : num  11.17 8.65 9.62 11.22 9.62 ...
##  $ bachelor: num  0 0 1 1 1 1 0 1 0 0 ...
##  $ female  : num  0 1 1 0 1 0 1 0 0 1 ...
##  $ age     : num  31 31 27 26 28 32 30 31 31 25 ...

Description of variables:

year: year
ahe: average hourly earnings
bachelor: 1 if worker has a bachelor’s degree; 0 if worker has a high school degree
female: 1 if female; 0 if male
age: age

1.2 Subsetting Data

Get the data for year 1996.

cps_96 <- subset(cps, year==1996)

str(cps_96)

## 'data.frame':    6103 obs. of  5 variables:
##  $ year    : num  1996 1996 1996 1996 1996 ...
##  $ ahe     : num  11.17 8.65 9.62 11.22 9.62 ...
##  $ bachelor: num  0 0 1 1 1 1 0 1 0 0 ...
##  $ female  : num  0 1 1 0 1 0 1 0 0 1 ...
##  $ age     : num  31 31 27 26 28 32 30 31 31 25 ...

Exercise-1: Get a subset of the data on high school graduates for year 1996.

cps_hs_96 <- subset(cps, bachelor==0 & year==1996)
str(cps_hs_96)

## 'data.frame':    3484 obs. of  5 variables:
##  $ year    : num  1996 1996 1996 1996 1996 ...
##  $ ahe     : num  11.17 8.65 14.79 15.89 9.13 ...
##  $ bachelor: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ female  : num  0 1 1 0 1 0 0 1 0 0 ...
##  $ age     : num  31 31 30 31 25 33 29 32 34 25 ...

1.3 Selecting Variables

Select the variable ahe (average hourly earnings)

ahe_96 <- cps_96$ahe

Exercise-2: Get a subset for year 2015 and select the variable ahe. Define the selected variable as ahe_15.

cps_15 <- subset(cps, year==2015)
ahe_15 <- cps_15$ahe

2. Summary Statistics

Obtain the summary statistics, including sample size (n), sample mean (\(\overline{Y}_{ahe}\), denoted by yb), sample standard deviation (\(\sigma^2_{ahe}\), denoted by sig), as well as standard error of the sample mean (\(\sigma^2_{ahe}/\sqrt{n}\), denoted by se), of ahe for year 1996 and 2015.

Sample size:

n_96 <- length(ahe_96)

Sample mean:

yb_96 <- mean(ahe_96)

Sample standard deviation:

sig_96 <- sd(ahe_96)

Standard error of the sample mean:

se_96 <- sd(ahe_96)/sqrt(n_96)

3. Confidence Intervals

3.1 CI for the Population Mean

The formula of CI for the population mean is \[\overline{Y} \pm z_{\alpha/2} \cdot \frac{{\hat\sigma_{Y}}}{\sqrt{n}}\] where \(z_{\alpha/2}\) can be obtained by R using qnorm(\(\alpha/2\)). For example, for a 95% CI, \(\alpha=0.05\), then

z <- qnorm(0.05/2)
print(z)

## [1] -1.959964

Construct a 95% CI for the population mean of ahe in 1996:

yb_96+z*se_96

## [1] 12.53372

yb_96-z*se_96

## [1] 12.8528

Exercise-3: Construct a 95% CI for ahe in 2015.

n_15 <- length(ahe_15)  # sample size: n
yb_15 <- mean(ahe_15)   # sample mean: y_bar
sig_15 <- sd(ahe_15)    # sample sd of y: sigma
se_15 <- sd(ahe_15)/sqrt(n_15)   # sample se of y_bar: sigma/sqrt(n)

# 95% CI #
yb_15+z*se_15

## [1] 20.95538

yb_15-z*se_15

## [1] 21.5195

In fact, CIs can be directly computed by R using t.test.

t.test(ahe_96)$conf.int

## [1] 12.53369 12.85283
## attr(,"conf.level")
## [1] 0.95

The default confidence level in t.test is 95%. If you want a confidence level different from 95%, you can set the confidence level in t.test using conf.level=. For example, construct a 90% CI for ahe in 1996:

t.test(ahe_96, conf.level = 0.9)$conf.int

## [1] 12.55935 12.82717
## attr(,"conf.level")
## [1] 0.9

Exercise-4: Using t.test, construct a 90% CI for ahe in 2015. Compare the 90% CI and the 95% CI, which one is wider?

t.test(ahe_15, conf.level = 0.9)$conf.int

## [1] 21.00069 21.47418
## attr(,"conf.level")
## [1] 0.9

3.2 CI for the Difference in the Population Means

Given that \(Y_a\) and \(Y_b\) are independent, the formula of CI for the population mean is \[(\overline{Y}_b - \overline{Y}_a) \pm z_{\alpha/2} \cdot \sqrt{\frac{{\hat\sigma_{Y_b}^2}}{n_b} + \frac{{\hat\sigma_{Y_a}^2}}{n_a}} \] where \(\hat{\sigma}_a^2 = \frac{1}{n_a-1}\sum_{i=1}^{n} (Y_{a,i} - \overline{Y}_a)^2\), \(\hat{\sigma}_b^2 = \frac{1}{n_b-1}\sum_{i=1}^{n} (Y_{b,i} - \overline{Y}_b)^2\).

Using t.test, construct a 95% CI for the change in the population means of ahe between 1996 and 2015:

t.test(ahe_15, ahe_96)$conf.int

## [1] 8.220087 8.868268
## attr(,"conf.level")
## [1] 0.95

4. Hypothesis Testing

4.1 The Two-Sided Test

For any hypothesis testing, you first need to specify the null hypothesis and the alternative hypothesis. In the example of testing for the population mean, we have \(H_0: \mu_{\bar{Y}} = 0\) vs \(H_a: \mu_{\bar{Y}} \neq 0\). Under the null hypothesis, the t-statistic is: \[t = \frac{\overline{Y}-\mu_{0}}{{\hat\sigma_{\overline{Y}}}}\]

Example 1: \(H_0: \mu_{ahe,96} = 0\) vs \(H_a: \mu_{ahe,96} \neq 0\)

The t-statistic is \[t = \frac{\overline{Y}_{ahe,96} - 0}{\hat\sigma_{\overline{Y}_{ahe,96}}}, \text{ under } H_0.\]

mu_0 <- 0        # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)

## [1] 155.9386

The p-value is \[p\text{-value} = 2\cdot \Pr(Z\leq-|t^{act}|) = 2\cdot \Pr(Z\leq-155.94).\]

p_96 <- 2*pnorm(-abs(t_96))
print(p_96)

## [1] 0

For a standard normal distribution, \(\Pr(Z \leq z)\) can be computed using pnorm(z). For a \(t\)-distribution with degrees of freedom \(n-k\), \(\Pr(X \leq x)\) can be computed using pt(x, n-k). For an \(F\)-distribution with degrees of freedom \(d_1\) and \(d_2\), pf(x, d1, d2) can be used.

Conclusion: Since the p-value is 0 (\(<0.05\)), we reject the null hypothesis at 5% significance level (\(\alpha=0.05\)).

Use t.test:

t.test(ahe_96)

## 
##  One Sample t-test
## 
## data:  ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  12.53369 12.85283
## sample estimates:
## mean of x 
##  12.69326

t.test(ahe_96, conf.level=0.9)    # at 10% significance level

## 
##  One Sample t-test
## 
## data:  ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  12.55935 12.82717
## sample estimates:
## mean of x 
##  12.69326

Exercise-5: Conduct the following hypothesis test: \(H_0: \mu_{ahe,96} = 12.7\) vs \(H_a: \mu_{ahe,96} \neq 12.7\).

Hint: The t-statistic is: \[t = \frac{\overline{Y}_{ahe,96} - 12.7}{\hat\sigma_{\overline{Y}_{ahe,96}}}, \text{ under } H_0.\]

mu_0 <- 12.7        # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)

## [1] -0.08279351

The p-value is

p_96 <- 2*pnorm(-abs(t_96))
print(p_96)

## [1] 0.9340157

Conclusion: Since the p-value is much greater than \(0.05\), there is not enough evidence to support the alternative hypothesis. Therefore, we do not reject the null hypothesis at 5% significance level.

Use t.test:

t.test(ahe_96, mu=12.7)

## 
##  One Sample t-test
## 
## data:  ahe_96
## t = -0.082794, df = 6102, p-value = 0.934
## alternative hypothesis: true mean is not equal to 12.7
## 95 percent confidence interval:
##  12.53369 12.85283
## sample estimates:
## mean of x 
##  12.69326

4.2 The One-Sided Test

For an upper-tailed test, \[H_0: \mu_{\bar{Y}} = \mu_0 \quad\text{vs} \quad H_a: \mu_{\bar{Y}} > \mu_0.\] For a lower-tailed test, \[H_0: \mu_{\bar{Y}} = \mu_0 \quad\text{vs} \quad H_a: \mu_{\bar{Y}} < \mu_0.\] The p-value can be computed by:

upper-tailed: \(p\text{-value}= \Pr(Z > t^{act}) = 1- \Phi(t^{act})\)
lower-tailed: \(p\text{-value}= \Pr(Z\leq t^{act}) = \Phi(t^{act})\)

Example 2: \(H_0: \mu_{ahe,96} = 0\) vs \(H_a: \mu_{ahe,96} > 0\)

The t-statistics are the same for both two-tailed test and one-tailed test. From Example 1, we have \(t_{96}=155.94\). Thus, the p-value for this upper-tailed test is \[p\text{-value} = \Pr(Z\geq t^{act}) = \Pr(Z\geq 155.94).\]

p_96 <- 1-pnorm(155.94)
print(p_96)

## [1] 0

Conclusion: Since the p-value is less than 0.05, we reject the null hypothesis at 5% significance level (\(\alpha=0.05\)).

Use t.test.

t.test(ahe_96, alternative="greater")

## 
##  One Sample t-test
## 
## data:  ahe_96
## t = 155.94, df = 6102, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  12.55935      Inf
## sample estimates:
## mean of x 
##  12.69326

Exercise-6: Conduct the following hypothesis test: \(H_0: \mu_{ahe,96} = 13\) vs \(H_a: \mu_{ahe,96} < 13\).

Hint: In t.test, we set alternative=“less”.

The t-statistics is

mu_0 <- 13        # the value of mu_Y under the null hypothesis
t_96 <- (yb_96-mu_0)/se_96
print(t_96)

## [1] -3.768339

The p-value for this lower-tailed test is \[p\text{-value} = \Pr(Z\leq t^{act}) = \Pr(Z\leq -3.77).\]

p_96 <- pnorm(t_96)
print(p_96)

## [1] 8.216877e-05