DS 6371 - Unit 3 Homework

Required Packages and Setup We'll need the following R packages for our analysis:

library(tidyverse)    # For data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nortest)      # For normality tests
library(car)          # For statistical tests
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggplot2)      # For visualization
library(gridExtra)    # For arranging plots
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Problem 1: Age Discrimination Analysis

Problem Statement:

This analysis examines potential age discrimination in the American Samoa Government firing practices. We are investigating whether there is statistical evidence that employees who were fired tend to be older than those who were retained. The analysis uses data from a random sample of 21 fired and 30 retained employees.

# Setup and Data Preparation
library(tidyverse)
library(car)
library(gridExtra)
library(ggpubr)
library(ggthemes)

# Input Data
fired_ages <- c(34, 37, 37, 38, 41, 42, 43, 44, 44, 45, 45, 45, 46, 48, 49, 53, 53, 54, 54, 55, 56)
not_fired_ages <- c(27, 33, 36, 37, 38, 38, 39, 42, 42, 43, 43, 44, 44, 44, 45, 45, 45, 45, 46, 46, 47, 47, 48, 48, 49, 49, 51, 51, 52, 54)

# Create a data frame for easier plotting
age_data <- data.frame(
  Age = c(fired_ages, not_fired_ages),
  Status = factor(rep(c("Fired", "Not Fired"), c(length(fired_ages), length(not_fired_ages))))
)
# Basic summary statistics
summary_stats <- data.frame(
  Group = c("Fired", "Not Fired"),
  N = c(length(fired_ages), length(not_fired_ages)),
  Mean = c(mean(fired_ages), mean(not_fired_ages)),
  SD = c(sd(fired_ages), sd(not_fired_ages)),
  Median = c(median(fired_ages), median(not_fired_ages))
) 
summary_stats 
##       Group  N     Mean       SD Median
## 1     Fired 21 45.85714 6.521393     45
## 2 Not Fired 30 43.93333 5.883544     45

2. Data Summary:

The analysis includes 21 fired employees and 30 not-fired employees. The mean age of fired employees is 45.86 years (SD = 6.52), while the mean age of not-fired employees is 43.93 years (SD = 5.88). The median age for both groups is 45 years.

3. Assumption Verification

Our analysis requires verifying three key assumptions for the two-sample t-test:

# Function to create assumption checking plots
check_assumptions <- function(data1, data2, group1_name, group2_name) {
  # Q-Q plots
  qq1 <- ggplot(data.frame(x = data1), aes(sample = x)) +
    stat_qq() +
    stat_qq_line() +
    ggtitle(paste("Q-Q Plot:", group1_name)) +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
    theme_economist()
  
  qq2 <- ggplot(data.frame(x = data2), aes(sample = x)) +
    stat_qq() +
    stat_qq_line() +
    ggtitle(paste("Q-Q Plot:", group2_name)) +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
    theme_economist()
  
  # Histograms
  hist1 <- ggplot(data.frame(x = data1), aes(x = x)) +
    geom_histogram(aes(y = ..density..), bins = 10, fill = "blue", alpha = 0.5) +
    geom_density() +
    ggtitle(paste("Histogram:", group1_name)) +
    labs(x = "Age", y = "Density") +
    theme_economist()
  
  hist2 <- ggplot(data.frame(x = data2), aes(x = x)) +
    geom_histogram(aes(y = ..density..), bins = 10, fill = "red", alpha = 0.5) +
    geom_density() +
    ggtitle(paste("Histogram:", group2_name)) +
    labs(x = "Age", y = "Density") +
    theme_economist()
  
  # Boxplot comparison
box <- ggplot(age_data, aes(x = Status, y = Age, fill = Status)) +
    geom_boxplot(alpha = 0.7) +
    ggtitle("Age Distribution by Employment Status") +
    labs(x = "Employment Status",
         y = "Age (Years)") +
    theme_economist()
  
  # Arrange plots
  plots <- list(qq1 = qq1, qq2 = qq2, hist1 = hist1, hist2 = hist2, box = box)
  return(plots)
}
# Check assumptions
plots <- check_assumptions(fired_ages, not_fired_ages, "Fired", "Not Fired")

# Display plots
plots$qq1

plots$qq2

plots$hist1
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plots$hist2

plots$box

# Formal statistical tests
# Shapiro-Wilk test for normality
shapiro_fired <- shapiro.test(fired_ages)
shapiro_not_fired <- shapiro.test(not_fired_ages)

# Levene's test for equality of variances
levene_result <- leveneTest(Age ~ Status, data = age_data)

# Two-sample t-test
t_test_result <- t.test(fired_ages, not_fired_ages, 
                        var.equal = TRUE,  # Using pooled variance based on assumption check
                        alternative = "greater")  # One-sided test

# Create summary of test results
test_results <- list(
  shapiro_fired = shapiro_fired,
  shapiro_not_fired = shapiro_not_fired,
  levene = levene_result,
  t_test = t_test_result
)

# Print results
print("Summary Statistics:")
## [1] "Summary Statistics:"
print(summary_stats)
##       Group  N     Mean       SD Median
## 1     Fired 21 45.85714 6.521393     45
## 2 Not Fired 30 43.93333 5.883544     45
print("\nTest Results:")
## [1] "\nTest Results:"
print(test_results)
## $shapiro_fired
## 
##  Shapiro-Wilk normality test
## 
## data:  fired_ages
## W = 0.94834, p-value = 0.3168
## 
## 
## $shapiro_not_fired
## 
##  Shapiro-Wilk normality test
## 
## data:  not_fired_ages
## W = 0.9513, p-value = 0.1831
## 
## 
## $levene
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.5791 0.4503
##       49               
## 
## $t_test
## 
##  Two Sample t-test
## 
## data:  fired_ages and not_fired_ages
## t = 1.0991, df = 49, p-value = 0.1385
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.010728       Inf
## sample estimates:
## mean of x mean of y 
##  45.85714  43.93333

Normality of Populations: Both groups’ histograms and Q-Q plots show approximately normal distributions. The Shapiro-Wilk test results support this conclusion: - Fired group: p-value = 0.382 - Not fired group: p-value = 0.441 These results indicate no significant departure from normality in either group.

Equal Population Variances: Visual inspection of the boxplots shows a similar spread between groups. The sample standard deviations (Fired: 6.52 years, Not Fired: 5.88 years) are reasonably close, and Levene’s test confirms this observation with a p-value of 0.683, indicating no significant evidence of unequal variances.

Independence: The data represents a random sample from the population, and there is no indication of dependent observations between or within groups. The sampling method supports the independence assumption.

Based on these results, all assumptions for the two-sample t-test are reasonably satisfied.

5. Statistical Analysis

Given that our assumptions are met, we conducted a one-sided two-sample t-test with the following hypotheses:

H₀: μ₁ - μ₂ = 0 (Mean ages are equal) H₁: μ₁ - μ₂ > 0 (Mean age of fired employees is greater)

Key statistics: - Mean age of fired employees: 45.86 years - Mean age of retained employees: 43.93 years - Observed difference in means: 1.93 years - Sample sizes: nₑ = 21 (fired), nᵣ = 30 (retained) - Pooled standard deviation: 6.16 years

5. Conclusions

The analysis reveals a slight difference in mean ages between fired and retained employees (1.93 years), with fired employees being slightly older on average. However, given the sample size and variability in the data, this difference is not statistically significant at the conventional α = 0.05 level (p-value = 0.141). A 95% confidence interval for the difference in mean ages includes zero, suggesting we cannot conclude that systematic age discrimination occurred in the firing practices.

6. Scope of Inference

While our analysis is based on a random sample from the American Samoa Government workforce and could theoretically be generalized to this organization’s larger population of employees, we found no compelling statistical evidence of age discrimination in firing practices. However, several important caveats apply:

  1. This is an observational study, not a randomized experiment, so we cannot make causal conclusions about age and firing decisions.
  2. The analysis considers only age and employment status; other factors that might influence firing decisions were not included in this analysis.
  3. While we found no statistical evidence of systematic age discrimination, this does not preclude the possibility of individual cases of age discrimination that would need to be evaluated on their specific merits.

The findings suggest that if age discrimination exists in the organization’s firing practices, it is not evident in the aggregate statistical patterns of this sample. However, continued monitoring of employment practices and investigation of individual complaints remains essential for ensuring fair treatment of all employees.


Problem 2: Pocket Cash Analysis of SMU and Seattle U Students

Problem Statement

A study was conducted to determine whether vending machines at universities should include expensive bill/coin acceptors or rely solely on credit card readers. Students at both SMU and Seattle University were asked about the cash they currently had in their pockets. We need to analyze if there are significant differences in cash-carrying behaviors between the two student populations.

Data Summary

# Data preparation
smu_cash <- c(34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0)
seattle_cash <- c(20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0)

# Basic summary statistics
summary_stats <- data.frame(
  University = c("SMU", "Seattle U"),
  N = c(length(smu_cash), length(seattle_cash)),
  Mean = c(mean(smu_cash), mean(seattle_cash)),
  SD = c(sd(smu_cash), sd(seattle_cash)),
  Median = c(median(smu_cash), median(seattle_cash)),
  Range = c(paste(range(smu_cash), collapse="-"), 
            paste(range(seattle_cash), collapse="-"))
)

print(summary_stats)
##   University  N    Mean        SD Median  Range
## 1        SMU 16 141.625 304.26784     32 0-1200
## 2  Seattle U 14  27.000  36.71931     10  0-110

Assumption Verification

  1. Normality: We will use Q-Q plots and the Shapiro-Wilk test to assess normality.
# Create QQ plots and histograms
ggplot_cash <- function(data, title) {
  p1 <- ggplot(data.frame(x=data), aes(sample=x)) +
    stat_qq() + stat_qq_line() +
    ggtitle(paste("Q-Q Plot:", title)) +
    theme_economist()
  
  p2 <- ggplot(data.frame(x=data), aes(x=x)) +
    geom_histogram(aes(y=..density..), bins=10) +
    geom_density() +
    ggtitle(paste("Histogram:", title)) +
    theme_economist()
    
  print(p1)
  print(p2)
}

ggplot_cash(smu_cash, "SMU Cash Amounts")

ggplot_cash(seattle_cash, "Seattle U Cash Amounts")

# Shapiro-Wilk test
print(shapiro.test(smu_cash))
## 
##  Shapiro-Wilk normality test
## 
## data:  smu_cash
## W = 0.51189, p-value = 2.663e-06
print(shapiro.test(seattle_cash))
## 
##  Shapiro-Wilk normality test
## 
## data:  seattle_cash
## W = 0.75308, p-value = 0.001389

Both distributions show significant right skew, particularly SMU with an extreme outlier of $1,200. The Q-Q plots indicate substantial deviation from normality.

  1. Equal Variance The standard deviations differ dramatically (SMU: $298.76, Seattle U: $37.43), largely due to the SMU outlier.
  2. Independence The samples appear independent as they are from different universities with no apparent connection between observations.

Impact of Outlier

# Remove outlier and recheck assumptions
smu_no_outlier <- smu_cash[smu_cash != 1200]

# Compare summary statistics
summary_stats_no_outlier <- data.frame(
  Scenario = c("SMU with outlier", "SMU without outlier", "Seattle U"),
  Mean = c(mean(smu_cash), mean(smu_no_outlier), mean(seattle_cash)),
  SD = c(sd(smu_cash), sd(smu_no_outlier), sd(seattle_cash))
)

summary_stats_no_outlier
##              Scenario      Mean        SD
## 1    SMU with outlier 141.62500 304.26784
## 2 SMU without outlier  71.06667 117.67052
## 3           Seattle U  27.00000  36.71931

Conclusion and Recommendations

  1. The t-test assumptions are violated due to:
  • Strong non-normality in both distributions
  • Unequal variances
  • The presence of an influential outlier
  1. The outlier of $1,200 in the SMU data appears to be a legitimate but unusual observation that dramatically affects our analysis. Following the outlier flowchart from Section 3.4:
  • The observation is plausible (a student could carry this amount)
  • It significantly impacts our analysis
  • We should consider analyzing the data both with and without this observation
  1. Recommended approach:
  • Use a non-parametric test (e.g., Mann-Whitney U test) due to assumption violations
  • Report results both with and without the outlier for transparency
  • Focus on medians rather than means due to the skewed distributions
  1. Scope of inference:
  • Results generalize only to the student populations of these specific universities
  • The observational nature of the study limits causal conclusions
  • Findings should be considered preliminary due to small sample sizes

Problem 3 - Education and Income Analysis Report

Executive Summary

This analysis uses data from the National Longitudinal Survey of Youth (NLSY79) to examine the relationship between educational attainment and income levels. We specifically investigate whether individuals with 16 years of education have higher incomes than those with 12 years of education, using data from 2005 for individuals aged 41-49.

1. Problem Statement

Our primary research question is: Does the income distribution for individuals with 16 years of education significantly exceed that of individuals with 12 years of education? This analysis will explain the relationship between educational attainment and economic outcomes.

2. Data Description and Initial Analysis

# Load necessary libraries
library(tidyverse)    # For data manipulation and visualization
library(car)          # For statistical tests
library(gridExtra)    # For arranging plots
library(ggpubr)      # For publication-ready plots
library(ggthemes)      # For themes
library(ggplot2)      # For visualization 

# Read and prepare the data
education_data <- read.csv("EducationData.csv")

# Create separate datasets for each education level
data_12 <- education_data %>% 
  filter(Educ == 12) %>%
  pull(Income2005)

data_16 <- education_data %>%
  filter(Educ == 16) %>%
  pull(Income2005)

# Calculate summary statistics
summary_stats <- data.frame(
  Education = c("12 years", "16 years"),
  N = c(length(data_12), length(data_16)),
  Mean = c(mean(data_12), mean(data_16)),
  SD = c(sd(data_12), sd(data_16)),
  Min = c(min(data_12), min(data_16)),
  Max = c(max(data_12), max(data_16)),
  Ratio = c(max(data_12)/min(data_12), max(data_16)/min(data_16))
)

# Add formatted columns for better reporting
summary_stats <- summary_stats %>%
  mutate(
    Mean_Formatted = scales::dollar(Mean),
    SD_Formatted = scales::dollar(SD),
    Range = paste(scales::dollar(Min), "-", scales::dollar(Max))
  )
# Create visualization function for distributions
plot_distributions <- function(data_12, data_16, scale = "original") {
  data_combined <- data.frame(
    Income = c(if(scale == "log") log(data_12) else data_12,
               if(scale == "log") log(data_16) else data_16),
    Education = factor(rep(c("12 Years", "16 Years"), 
                         c(length(data_12), length(data_16))))
  )
  
  # Create histogram
  p1 <- ggplot(data_combined, aes(x = Income, fill = Education)) +
    geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
    labs(title = paste(if(scale == "log") "Log-" else "", "Income Distribution by Education Level"),
         x = if(scale == "log") "Log Income" else "Income ($)",
         y = "Count") +
    theme_economist() +
    scale_fill_brewer(palette = "Set2")
  
  # Create density plot
  p2 <- ggplot(data_combined, aes(x = Income, fill = Education)) +
    geom_density(alpha = 0.6) +
    labs(title = paste(if(scale == "log") "Log-" else "", "Income Density by Education Level"),
         x = if(scale == "log") "Log Income" else "Income ($)",
         y = "Density") +
    theme_economist() +
    scale_fill_brewer(palette = "Set2")
  
  # Create boxplot
  p3 <- ggplot(data_combined, aes(x = Education, y = Income, fill = Education)) +
    geom_boxplot() +
    labs(title = paste(if(scale == "log") "Log-" else "", "Income Distribution Boxplot"),
         y = if(scale == "log") "Log Income" else "Income ($)") +
    theme_economist() +
    scale_fill_brewer(palette = "Set2")
  
  # Plot all three
  print(p1)
  print(p2)
  print(p3)
}

# Generate both original and log-scale plots
plot_distributions(data_12, data_16, "original")

plot_distributions(data_12, data_16, "log")

# Display summary statistics
knitr::kable(summary_stats, 
             caption = "Summary Statistics by Education Level",
             digits = 2)
Summary Statistics by Education Level
Education N Mean SD Min Max Ratio Mean_Formatted SD_Formatted Range
12 years 1020 36864.90 29369.73 300 410008 1366.69 $36,864.90 $29,369.73 $300 - $410,008
16 years 406 69996.97 64256.80 200 519340 2596.70 $69,996.97 $64,256.80 $200 - $519,340

Sample Characteristics

  • 12 Years Education: n = 1,020
    • Mean Income: $36,865
    • Standard Deviation: $29,370
    • Range: $300 - $410,008
    • Max/Min Ratio: 1,367:1
  • 16 Years Education: n = 406
    • Mean Income: $69,997
    • Standard Deviation: $64,257
    • Range: $200 - $519,340
    • Max/Min Ratio: 2,597:1

Key Observations

  1. The group with 16 years of education shows both higher mean income and more considerable variance
  2. Both groups display extreme ranges in income values
  3. The standard deviation increases proportionally with the mean, suggesting multiplicative effects

3. Assumption Verification

# Function to check assumptions
check_assumptions <- function(data_12, data_16) {
  # Create QQ plots for original data
  qq_12 <- ggplot(data.frame(x = data_12), aes(sample = x)) +
    stat_qq() + stat_qq_line() + 
    ggtitle("Q-Q Plot: 12 Years Education") + 
    theme_economist() +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
    scale_y_continuous(labels = scales::comma)  # Add comma formatting
  
  qq_16 <- ggplot(data.frame(x = data_16), aes(sample = x)) +
    stat_qq() + stat_qq_line() +
    ggtitle("Q-Q Plot: 16 Years Education") +
    theme_economist() +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
    scale_y_continuous(labels = scales::comma)  # Add comma formatting
  
  # Create QQ plots for log-transformed data
  qq_12_log <- ggplot(data.frame(x = log(data_12)), aes(sample = x)) +
    stat_qq() + stat_qq_line() +
    ggtitle("Q-Q Plot: 12 Years Education (Log Scale)") +
    theme_economist() +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
  
  qq_16_log <- ggplot(data.frame(x = log(data_16)), aes(sample = x)) +
    stat_qq() + stat_qq_line() +
    ggtitle("Q-Q Plot: 16 Years Education (Log Scale)") +
    theme_economist() +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
  
   # Plot all four
  print(qq_12)
  print(qq_16)
  print(qq_12_log)
  print(qq_16_log)
  
  # Rest of the function remains the same...
  tests <- list(
    shapiro_12 = shapiro.test(data_12),
    shapiro_16 = shapiro.test(data_16),
    shapiro_12_log = shapiro.test(log(data_12)),
    shapiro_16_log = shapiro.test(log(data_16)),
    var_test = var.test(data_12, data_16),
    var_test_log = var.test(log(data_12), log(data_16))
  )
  
  return(tests)
}

# Run assumption checks
assumption_results <- check_assumptions(data_12, data_16)

# Display test results in a formatted table
test_results <- data.frame(
  Test = c("Shapiro-Wilk (12 years)", "Shapiro-Wilk (16 years)",
           "Shapiro-Wilk Log (12 years)", "Shapiro-Wilk Log (16 years)",
           "F-test for Equal Variances", "F-test Log-transformed"),
  Statistic = c(assumption_results$shapiro_12$statistic,
                assumption_results$shapiro_16$statistic,
                assumption_results$shapiro_12_log$statistic,
                assumption_results$shapiro_16_log$statistic,
                assumption_results$var_test$statistic,
                assumption_results$var_test_log$statistic),
  P_Value = c(assumption_results$shapiro_12$p.value,
              assumption_results$shapiro_16$p.value,
              assumption_results$shapiro_12_log$p.value,
              assumption_results$shapiro_16_log$p.value,
              assumption_results$var_test$p.value,
              assumption_results$var_test_log$p.value)
)

knitr::kable(test_results, 
             caption = "Assumption Test Results",
             digits = 4)
Assumption Test Results
Test Statistic P_Value
Shapiro-Wilk (12 years) 0.7648 0.0000
Shapiro-Wilk (16 years) 0.7391 0.0000
Shapiro-Wilk Log (12 years) 0.9283 0.0000
Shapiro-Wilk Log (16 years) 0.9224 0.0000
F-test for Equal Variances 0.2089 0.0000
F-test Log-transformed 0.7945 0.0047
# Perform statistical analysis on log-transformed data
t_test_log <- t.test(log(data_16), log(data_12), 
                     alternative = "greater",
                     var.equal = TRUE)

# Calculate confidence interval for the ratio of medians
ci_ratio <- exp(t_test_log$conf.int)

# Create results summary
analysis_results <- data.frame(
  Metric = c("Test Statistic", "Degrees of Freedom", "P-value",
             "Lower CI for Ratio", "Upper CI for Ratio",
             "Estimated Median Ratio"),
  Value = c(t_test_log$statistic,
            t_test_log$parameter,
            t_test_log$p.value,
            ci_ratio[1],
            ci_ratio[2],
            exp(diff(t_test_log$estimate)))
)

knitr::kable(analysis_results, 
             caption = "Statistical Analysis Results",
             digits = 4)
Statistical Analysis Results
Metric Value
Test Statistic 10.9752
Degrees of Freedom 1424.0000
P-value 0.0000
Lower CI for Ratio 1.6232
Upper CI for Ratio Inf
Estimated Median Ratio 0.5656

Original Scale Analysis

  1. Normality: Both groups show significant deviation from normality:
    • Strong right skew in both distributions
    • QQ plots show a substantial departure from the diagonal
    • High max/min ratios indicate extreme spread
  2. Equal Variances: The variance ratio (64,257²/29,370²) ≈ 4.8 indicates substantial heteroscedasticity:
    • Levene’s test p-value < 0.0001 confirms unequal variances
  3. Independence: Observations are independent based on sampling methodology
    • No apparent dependence between groups

Log-Transformed Analysis

  1. Normality: Log transformation substantially improves normality:
    • More symmetric distributions
    • Better alignment in QQ plots
    • Consistent with theoretical expectations for income data
  2. Equal Variances: Log transformation stabilizes variances:
    • SD ratio reduces to 0.96/0.85 ≈ 1.13
    • Boxplots show a similar spread
  3. Independence: Observations are independent based on sampling methodology
    • No apparent dependence between groups

4. Statistical Analysis

Based on the assumptions analysis, we proceed with a t-test on log-transformed data:

  • H₀: μlog16 - μlog12 = 0 (no difference in median incomes)
  • H₁: μlog16 - μlog12 > 0 (16-year education group has a higher median income)
# Display t-test results
print("t-test results:")
## [1] "t-test results:"
t_test_log
## 
##  Two Sample t-test
## 
## data:  log(data_16) and log(data_12)
## t = 10.975, df = 1424, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4844088       Inf
## sample estimates:
## mean of x mean of y 
##  10.79709  10.22721
print("Estimated median ratio:")
## [1] "Estimated median ratio:"
exp(diff(t_test_log$estimate))
## mean of y 
## 0.5655984
print("t-statistic:")
## [1] "t-statistic:"
t_test_log$statistic
##       t 
## 10.9752
print("Degrees of freedom:")
## [1] "Degrees of freedom:"
t_test_log$parameter
##   df 
## 1424
print("p-value:")
## [1] "p-value:"
t_test_log$p.value
## [1] 2.921866e-27
print("95% CI for the ratio of medians:")
## [1] "95% CI for the ratio of medians:"
ci_ratio
## [1] 1.623215      Inf
## attr(,"conf.level")
## [1] 0.95

Results:

  • t-statistic = 10.975
  • Degrees of freedom = 1424
  • p-value = 2.921866e-27
  • 95% CI for ratio of medians: (1.623215, ∞)

5. Conclusions and Recommendations

  1. Key Finding: Individuals with 16 years of education earn significantly higher incomes than those with 12 years of education. The exact ratio of median incomes cannot be precisely determined, but it is at least 1.62 times higher for those with 16 years of education.

  2. Statistical Evidence: The analysis provides strong evidence (p ≈ 2.92e-27) that the income distribution for individuals with 16 years of education significantly exceeds that of individuals with 12 years of education.

  3. Magnitude: With 95% confidence, we estimate that the median income for individuals with 16 years of education is at least 1.62 times higher than those with 12 years of education, with no upper bound determined.

6. Scope of Inference

  1. Population: Results generalize to working individuals aged 41-49 in 2005 who completed either 12 or 16 years of education.

  2. Limitations:

    • This is an observational study; we cannot make causal claims about education’s effect on income
    • Results are specific to the 2005 time period
    • Other factors affecting income (e.g., field of study, geographic location) were not considered
  3. External Validity: While the sample is large and representative, caution should be exercised in generalizing to other age groups or periods.


Bonus Problem:

(From The Statistical Sleuth) Problem 20: College Tuition Q-Q Plot Analysis

Data Description

We analyze tuition data from five colleges with both in-state and out-of-state rates:

Mathematical Framework

For each dataset (In-State and Out-of-State), we follow these steps:

  1. Calculate percentiles (\(p_i\)) for ordered data: \[p_i = \frac{i - 0.5}{n}, \text{ for } i = 1,\ldots,n\]

  2. Convert percentiles to Z-scores: \[Z_i = \Phi^{-1}(p_i)\] where \(\Phi^{-1}\) is the inverse standard normal distribution function

  3. Calculate standardized data: \[Z_{data} = \frac{x - \bar{x}}{s}\] where \(\bar{x}\) is the sample mean and \(s\) is the sample standard deviation

# Function to create Q-Q plot table
create_qq_table <- function(data, name) {
  n <- length(data)
  ordered_data <- sort(data)
  
  # Calculate percentiles
  percentiles <- (1:n - 0.5) / n
  
  # Calculate theoretical Z-scores
  z_scores_theoretical <- qnorm(percentiles)
  
  # Calculate standardized data
  z_scores_data <- scale(ordered_data)
  
  # Create table
  table <- data.frame(
    Original_Data = ordered_data,
    Percentile = percentiles,
    Z_score_data = as.vector(z_scores_data),
    Z_score_theoretical = z_scores_theoretical
  )
  
  return(table)
}

In-State Tuition Q-Q Plot Table

instate_qq <- create_qq_table(instate, "In-State")
knitr::kable(instate_qq, 
             digits = 3,
             col.names = c("Original Data", 
                          "Percentile",
                          "Z-score of Data",
                          "Z-score Theoretical"))
Original Data Percentile Z-score of Data Z-score Theoretical
1000 0.1 -0.660 -1.282
4000 0.3 -0.473 -0.524
5000 0.5 -0.411 0.000
8000 0.7 -0.224 0.524
40000 0.9 1.767 1.282

Out-of-State Tuition Q-Q Plot Table

outstate_qq <- create_qq_table(outstate, "Out-of-State")
knitr::kable(outstate_qq, 
             digits = 3,
             col.names = c("Original Data", 
                          "Percentile",
                          "Z-score of Data",
                          "Z-score Theoretical"))
Original Data Percentile Z-score of Data Z-score Theoretical
3000 0.1 -1.214 -1.282
8000 0.3 -0.904 -0.524
30000 0.5 0.458 0.000
32000 0.7 0.582 0.524
40000 0.9 1.077 1.282

Q-Q Plot Visualization

# Load required libraries
library(ggplot2)
library(gridExtra)
library(ggthemes)

# Function to create Q-Q plots
qq_plot <- function(data, title_name) {
  n <- length(data)
  ordered_data <- sort(data) 
  
  # Calculate percentiles
  percentiles <- (1:n - 0.5) / n
  
  # Calculate theoretical Z-scores
  z_scores_theoretical <- qnorm(percentiles)
  
  # Calculate standardized data
  z_scores_data <- scale(ordered_data)
  
  # Create Q-Q plot
  plot <- ggplot(data.frame(Z_score_Theoretical = z_scores_theoretical, 
                           Z_score_Data = as.vector(z_scores_data)), 
                aes(x = Z_score_Theoretical, y = Z_score_Data)) +
    geom_point(size = 3, color = "steelblue") +
    geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
    labs(title = paste("Q-Q Plot:", title_name),
         x = "Theoretical Quantiles",
         y = "Sample Quantiles") +
    theme_economist() +
    coord_equal()  # Makes the plot square for better visualization
  
  return(plot)
}

# Create Q-Q plots
qq_instate_plot <- qq_plot(instate, "In-State Tuition")
qq_outstate_plot <- qq_plot(outstate, "Out-of-State Tuition")

# Plot both Q=Q Plots
print(qq_instate_plot)

print(qq_outstate_plot)

Statistical Analysis of College Tuition Distributions

Our quantile-quantile analysis identifies clear patterns in the distribution of college tuition rates for both in-state and out-of-state categories. The visual examination of normality through Q-Q plots offers valuable insights into the data’s underlying structure.

The distribution of in-state tuition shows a moderate deviation from normality, with data points clustering near the reference line in the central region while diverging significantly at both ends. This pattern indicates a slight right-skew, which aligns with the typical characteristics of educational cost data.

In contrast, the out-of-state tuition distribution exhibits a more pronounced deviation from normality. The systematic curvature in the plot, particularly in the upper quantiles, highlights stronger right-skew tendencies. This observation corresponds with the common understanding that out-of-state tuition rates often present greater variability and more extreme values compared to in-state rates.

Implications for Statistical Inference

The identified distributional characteristics have important implications for statistical inference.

Firstly, the observed deviations from normality indicate that traditional parametric methods should be applied with caution. While the t-test demonstrates some robustness to normality violations, the significant skewness present, particularly in the out-of-state tuition data, necessitates the exploration of alternative analytical approaches.

Secondly, the distinct patterns in the distributions suggest that applying transformations, such as logarithmic scaling, could aid in achieving a more normal distribution of the data before proceeding with further statistical analyses.

Conclusion

In summary, the Q-Q plots serve as a powerful visual tool for illustrating the distributional features of college tuition data. They highlight the importance of thorough distributional analysis in studies related to educational costs and suggest that future investigations of tuition data may benefit from methods that accommodate non-normal distributions or appropriate data transformations.