In this handout, I use ChatGPT 4.0 and the KeyMate.AI Search plugin to synthesize the information from the textbook, give examples, and create R code. All prompts are exact, but some of the output may be modified for correctness.

Hanck, Arnold, Gerber, and Schmelzer (2023). Introduction to Econometrics with R. https://www.econometrics-with-r.org/index.html. This book is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Prompt

Synthesize the information in these websites. Include all relevant formulas and definitions.

https://www.econometrics-with-r.org/3.1-estimation-of-the-population-mean.html

https://www.econometrics-with-r.org/3.2-potsm.html

https://www.econometrics-with-r.org/3.3-hypothesis-tests-concerning-the-population-mean.html

3.1 Estimation of the Population Mean

  • Population Mean (μ): The average of all possible observations in a population.
  • Sample Mean (x̄): The average of observations in a sample.
  • Central Limit Theorem: For a sufficiently large sample size, the distribution of the sample mean will be approximately normally distributed, regardless of the population’s distribution.
  • Standard Error (SE): Measures the dispersion of the sample mean. It is calculated as: \[ SE = \frac{σ}{\sqrt{n}} \] Where:
    • \(σ\) is the population standard deviation.
    • \(n\) is the sample size.
  • Confidence Interval: Gives an estimated range of values which is likely to include the population mean. It is calculated as: \[ x̄ ± z \times SE \] Where:
    • \(z\) is the z-value corresponding to the desired confidence level.

3.2 Point of Symmetry and the Median

  • Point of Symmetry (POS): A value for which the distribution of the data is symmetric. For a normal distribution, the POS is equal to the mean.
  • Median: The value that separates the higher half from the lower half of a data sample. For a symmetric distribution, the median is equal to the POS.
  • Skewness: A measure of the asymmetry of the distribution of a dataset. Positive skewness indicates that the tail on the right side is longer than the left side, and vice versa.
  • Kurtosis: Measures the “tailedness” of the distribution. High kurtosis indicates a distribution with tails heavier than a normal distribution, and low kurtosis indicates a distribution with tails lighter than a normal distribution.

3.3 Hypothesis Tests Concerning the Population Mean

  • Null Hypothesis (H0): A statement that there is no effect or no difference. It is a starting point for statistical significance testing.
  • Alternative Hypothesis (H1): A statement that indicates the presence of an effect or difference.
  • Test Statistic: A standardized value that is calculated from sample data during a hypothesis test. It helps in deciding whether to reject the null hypothesis.
  • P-value: The probability of observing a test statistic as extreme as, or more extreme than, the statistic computed from the sample, given that the null hypothesis is true.
  • Decision Rule: If the p-value is less than the chosen significance level (α), then the null hypothesis is rejected in favor of the alternative hypothesis.
  • Type I Error: Occurs when the null hypothesis is true, but it is rejected.
  • Type II Error: Occurs when the null hypothesis is false, but it is not rejected.

This synthesis provides an overview of the concepts related to the estimation of the population mean, point of symmetry, median, and hypothesis testing concerning the population mean.

Prompt

Go into more detail about the population mean, standard error, and confidence intervals.

Population Mean (μ)

The population mean, denoted by \(\mu\), represents the average of all possible observations in a population. For instance, consider an economic variable like the hourly earnings of college graduates. To precisely calculate \(\mu\), one would need to interview every working graduate in the economy. However, due to time and cost constraints, this is often not feasible. Instead, a random sample is drawn, and the sample mean is used as an estimator for the population mean.

Standard Error (SE)

The standard error measures the dispersion or variability of the sample mean. It provides an indication of how much the sample mean is expected to vary from the true population mean. The formula for the standard error is: \[ SE = \frac{σ}{\sqrt{n}} \] Where:

  • \(σ\) is the population standard deviation.
  • \(n\) is the sample size.

The standard error is crucial because it helps in determining the precision of the sample mean as an estimate of the population mean. A smaller SE indicates that the sample mean is a more precise estimator of the population mean.

Confidence Interval

A confidence interval provides a range of values within which the true population mean is likely to fall, with a certain level of confidence. The formula for a confidence interval (for large sample sizes) is: \[ x̄ ± z \times SE \] Where:

  • \(x̄\) is the sample mean.
  • \(z\) is the z-value corresponding to the desired confidence level (e.g., 1.96 for a 95% confidence level).
  • \(SE\) is the standard error.

The width of the confidence interval provides insight into the precision of the estimate. A narrower confidence interval indicates a more precise estimate of the population mean. Conversely, a wider interval suggests more uncertainty.

In the context of the provided content, consider hourly earnings of college graduates. If we were to draw a sample and calculate the sample mean, the standard error would give us an idea of how much this sample mean might vary from the true average earnings of all college graduates. Using this standard error, we can then construct a confidence interval to estimate a range within which the true average earnings likely fall.

In summary, while the population mean provides an overall average, the standard error and confidence intervals offer insights into the reliability and precision of estimates derived from sample data.

Prompt

Create a couple real-world, economics related examples of creating hypothesis tests concerning the population mean.

Example 1: Average Household Income

Background: A government agency believes that the average annual household income in a particular city is $50,000. A local economist thinks that this figure is outdated and that the average income has increased due to recent economic growth in the region.

Hypothesis Test:

  • Null Hypothesis (H0): The average annual household income \(\mu\) is $50,000. ( H0: = $50,000 )

  • Alternative Hypothesis (H1): The average annual household income \(\mu\) is greater than $50,000. ( H1: > $50,000 )

Procedure:

  1. Collect a random sample of annual incomes from households in the city.
  2. Calculate the sample mean and standard error.
  3. Conduct a one-tailed t-test to determine if the sample mean is significantly greater than $50,000.
  4. If the p-value is less than a significance level (e.g., 0.05), reject the null hypothesis.

Example 2: Unemployment Rate

Background: A recent report suggests that the national unemployment rate is 6%. An economist believes that in a specific state, due to recent layoffs in major industries, the unemployment rate is higher than the national average.

Hypothesis Test:

  • Null Hypothesis (H0): The unemployment rate \(\mu\) in the state is 6%. \(H0: \mu = 6\%\)

  • Alternative Hypothesis (H1): The unemployment rate \(\mu\) in the state is greater than 6%. \(H1: \mu > 6\%\)

Procedure:

  1. Collect a random sample of employment statuses from residents of the state.
  2. Calculate the sample unemployment rate.
  3. Conduct a one-tailed z-test or t-test (depending on sample size and variance known) to determine if the sample unemployment rate is significantly greater than 6%.
  4. If the p-value is less than a significance level (e.g., 0.05), reject the null hypothesis.

In both examples, the hypothesis tests help in making informed decisions based on sample data and provide insights into the economic conditions of the region or state in question.

Prompt

Create some example problems where students must calculate the z-statistic and conduct a hypothesis test from data.

Example 1: Exam Scores

Background: The national average score for a standardized exam is known to be 500 with a standard deviation of 100. A school believes that its students, on average, score above the national average. To test this claim, they randomly select 36 students and find that their average score is 520.

Problem:

  1. State the null and alternative hypotheses.
  2. Calculate the z-statistic.
  3. Using a significance level of 0.05, conduct a hypothesis test to determine if the school’s claim is valid.

Solution:

    • Null Hypothesis (H0): \(\mu = 500\)
    • Alternative Hypothesis (H1): \(\mu > 500\)
  1. \(z = \frac{x̄ - \mu}{\frac{σ}{\sqrt{n}}}\)

    \(z = \frac{520 - 500}{\frac{100}{\sqrt{36}}} = \frac{20}{10} = 2\)

  2. For a one-tailed test at a significance level of 0.05, the critical z-value is approximately 1.645. Since 2 > 1.645, we reject the null hypothesis. The school’s claim is valid.

Example 2: Soda Can Filling

Background: A soda company claims that their machine fills cans with an average of 12 ounces of soda. A quality control team suspects that the machine is not filling to that level. They randomly select 49 cans and find that the average fill is 11.95 ounces with a standard deviation of 0.3 ounces.

Problem:

  1. State the null and alternative hypotheses.
  2. Calculate the z-statistic.
  3. Using a significance level of 0.05, conduct a hypothesis test to determine if the machine is filling cans with less than 12 ounces of soda.

Solution:

    • Null Hypothesis (H0): \(\mu = 12\)
    • Alternative Hypothesis (H1): \(\mu < 12\)
  1. \(z = \frac{x̄ - \mu}{\frac{σ}{\sqrt{n}}}\)

    \(z = \frac{11.95 - 12}{\frac{0.3}{\sqrt{49}}} = \frac{-0.05}{0.0429} \approx -1.165\)

  2. For a one-tailed test at a significance level of 0.05, the critical z-value is approximately -1.645. Since -1.165 > -1.645, we fail to reject the null hypothesis. The machine appears to be filling cans correctly.

Prompt

Go into more detail about the difference between the student t-distribution and t-statistics from the normal distribution and z-statistics.

Normal Distribution and Z-statistics:

  1. Normal Distribution: It is a bell-shaped, symmetrical distribution described by its mean (μ) and standard deviation (σ). The total area under its curve is 1. The standard normal distribution has a mean of 0 and a standard deviation of 1.

  2. Z-statistic (or Z-score): It represents the number of standard deviations an observation (or sample mean) is away from the population mean. It’s calculated as: \[ Z = \frac{X - μ}{σ} \] Where:

    • \(X\) is the observation (or sample mean).
    • \(μ\) is the population mean.
    • \(σ\) is the population standard deviation.
  3. When to Use: The z-statistic is used when:

    • The population standard deviation (σ) is known.
    • The sample size is large (typically \(n > 30\)), even if σ is unknown.
    • The underlying data is approximately normally distributed.

Student’s t-distribution and t-statistics:

  1. Student’s t-distribution: It is similar in shape to the normal distribution but has heavier tails. This makes it more accommodating to potential outliers. The exact shape of the t-distribution depends on a parameter called “degrees of freedom,” often denoted by \(df\). As \(df\) increases, the t-distribution approaches the normal distribution.

  2. T-statistic: It’s used when estimating the population mean from a sample, especially when the sample size is small or the population standard deviation is unknown. It’s calculated as: \[ t = \frac{x̄ - μ}{s/√n} \] Where:

    • \(x̄\) is the sample mean.
    • \(μ\) is the population mean (or hypothesized value).
    • \(s\) is the sample standard deviation.
    • \(n\) is the sample size.
  3. When to Use: The t-statistic and t-distribution are used when:

    • The population standard deviation (σ) is unknown.
    • The sample size is small (typically \(n ≤ 30\)).
    • The underlying data is approximately normally distributed or if the sample size is sufficiently large to invoke the Central Limit Theorem.

Key Differences:

  1. Standard Deviation: For z-statistics, the population standard deviation is used, while for t-statistics, the sample standard deviation is used.

  2. Distribution Shape: The t-distribution has thicker tails compared to the normal distribution. This accounts for the increased variability expected with smaller samples.

  3. Sample Size: Z-statistics are typically used for larger samples (or when the population standard deviation is known), while t-statistics are used for smaller samples.

  4. Degrees of Freedom: The t-distribution requires the specification of degrees of freedom, which is typically \(df = n - 1\) for a single sample. The concept of degrees of freedom doesn’t apply to the normal distribution.

In summary, while both the z and t distributions are foundational in inferential statistics, they are used in different scenarios based on the known information and sample size. The t-distribution provides a more conservative approach (due to its thicker tails) when dealing with smaller samples or unknown population standard deviations.

Prompt

Go into detail about the difference between one-sample, two-sample, and paired t-tests.

1. One-sample t-test:

Purpose: Used to test if the mean of a single sample is significantly different from a known or hypothesized population mean.

Hypotheses:

  • Null Hypothesis (H0): The sample mean is equal to the population mean. \(H0: \mu = \mu_0\)
  • Alternative Hypothesis (H1): The sample mean is not equal to the population mean. \(H1: \mu \neq \mu_0\)

Example: Suppose a manufacturer claims that a type of battery lasts 100 hours on average. You take a sample of batteries, test them, and want to see if the average lifespan differs from the claimed 100 hours.

2. Two-sample t-test (Independent t-test):

Purpose: Used to compare the means of two independent samples to see if they come from populations with the same mean.

Hypotheses:

  • Null Hypothesis (H0): The means of the two populations are equal. \(H0: \mu_1 = \mu_2\)
  • Alternative Hypothesis (H1): The means of the two populations are not equal. \(H1: \mu_1 \neq \mu_2\)

Example: Imagine you want to test if there’s a difference in average test scores between students taught by two different teachers. You would collect scores from a sample of students from each teacher’s class and then compare the means using the two-sample t-test.

3. Paired t-test (Dependent t-test):

Purpose: Used to compare the means of the same group at two different times or under two different conditions. The observations are paired, meaning each observation in one sample corresponds to an observation in the other sample.

Hypotheses:

  • Null Hypothesis (H0): The mean difference between paired observations is zero. \(H0: \mu_d = 0\)
  • Alternative Hypothesis (H1): The mean difference between paired observations is not zero. \(H1: \mu_d \neq 0\)

Example: Consider a clinical trial where you want to test the effectiveness of a new drug. Patients’ health metrics are measured before taking the drug and after completing the treatment. Since the measurements before and after are from the same individuals, the data is paired.

Key Differences:

  • Data Structure:

    • One-sample t-test works with a single sample.
    • Two-sample t-test works with two independent samples.
    • Paired t-test works with two related (paired) samples.
  • Application:

    • One-sample is used when comparing a sample mean to a known or hypothesized population mean.
    • Two-sample is used when comparing the means of two independent groups.
    • Paired is used when comparing the means of the same group under two different scenarios or at two different times.
  • Assumptions: All three tests assume that the data is approximately normally distributed. The two-sample t-test also assumes that the variances of the two groups are equal (homogeneity of variances), unless you’re doing a variation of the test that doesn’t make this assumption (Welch’s t-test).

Prompt

Create some examples of using R to conduct hypothesis testing of means.

Example 1: One-sample t-test

Suppose we have a sample of exam scores from 10 students, and we want to test if the average score is different from 75.

# Sample data
scores <- c(78, 82, 74, 69, 77, 83, 72, 76, 80, 71)

# One-sample t-test
t.test(scores, mu = 75)

    One Sample t-test

data:  scores
t = 0.80579, df = 9, p-value = 0.4411
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
 72.83115 79.56885
sample estimates:
mean of x 
     76.2 

Example 2: Two-sample t-test

Imagine we have exam scores from two different classes, and we want to test if there’s a significant difference in their means.

# Sample data
class1_scores <- c(78, 82, 74, 69, 77, 83, 72, 76, 80, 71)
class2_scores <- c(85, 88, 84, 86, 87, 89, 83, 84, 82, 81)

# Two-sample t-test
t.test(class1_scores, class2_scores)

    Welch Two Sample t-test

data:  class1_scores and class2_scores
t = -5.1137, df = 14.024, p-value = 0.0001568
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12.348348  -5.051652
sample estimates:
mean of x mean of y 
     76.2      84.9 

Example 3: Paired t-test

Suppose we measure the blood pressure of 5 patients before and after a treatment, and we want to test if the treatment had a significant effect.

# Sample data
before_treatment <- c(150, 165, 155, 142, 138)
after_treatment <- c(145, 160, 152, 140, 135)

# Paired t-test
t.test(before_treatment, after_treatment, paired = TRUE)

    Paired t-test

data:  before_treatment and after_treatment
t = 6, df = 4, p-value = 0.003883
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.934133 5.265867
sample estimates:
mean difference 
            3.6 

Example 4: One-sample z-test (using external package)

For large samples, we might prefer a z-test. Suppose we have a large sample of weights from a factory, and we want to test if the average weight is different from 50 grams. (Note: R doesn’t have a built-in function for the one-sample z-test, so we’ll use the z.test function from the BSDA package.)

# Install and load the BSDA package
install.packages("BSDA")
also installing the dependencies ‘proxy’, ‘e1071’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/proxy_0.4-27.zip'
Content type 'application/zip' length 179924 bytes (175 KB)
downloaded 175 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/e1071_1.7-13.zip'
Content type 'application/zip' length 653294 bytes (637 KB)
downloaded 637 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/BSDA_1.2.1.zip'
Content type 'application/zip' length 901430 bytes (880 KB)
downloaded 880 KB
package ‘proxy’ successfully unpacked and MD5 sums checked
package ‘e1071’ successfully unpacked and MD5 sums checked
package ‘BSDA’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\rselby\AppData\Local\Temp\RtmpYT4aIL\downloaded_packages
library(BSDA)
Loading required package: lattice

Attaching package: ‘BSDA’

The following object is masked from ‘package:datasets’:

    Orange
# Sample data (assuming a large sample)
weights <- rnorm(1000, mean = 50.5, sd = 5)  # Generating random data around 50.5 with sd = 5

# One-sample z-test
z.test(weights, mu = 50, sigma.x = sd(weights))

    One-sample z-Test

data:  weights
z = 3.6077, p-value = 0.0003089
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 50.25829 50.87274
sample estimates:
mean of x 
 50.56552 
