Required Packages and Setup
We'll need the following R packages for our analysis
:
library(tidyverse) # For data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nortest) # For normality tests
library(car) # For statistical tests
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2) # For visualization
library(gridExtra) # For arranging plots
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
This analysis examines potential age discrimination in the American Samoa Government firing practices. We are investigating whether there is statistical evidence that employees who were fired tend to be older than those who were retained. The analysis uses data from a random sample of 21 fired and 30 retained employees.
# Setup and Data Preparation
library(tidyverse)
library(car)
library(gridExtra)
library(ggpubr)
library(ggthemes)
# Input Data
fired_ages <- c(34, 37, 37, 38, 41, 42, 43, 44, 44, 45, 45, 45, 46, 48, 49, 53, 53, 54, 54, 55, 56)
not_fired_ages <- c(27, 33, 36, 37, 38, 38, 39, 42, 42, 43, 43, 44, 44, 44, 45, 45, 45, 45, 46, 46, 47, 47, 48, 48, 49, 49, 51, 51, 52, 54)
# Create a data frame for easier plotting
age_data <- data.frame(
Age = c(fired_ages, not_fired_ages),
Status = factor(rep(c("Fired", "Not Fired"), c(length(fired_ages), length(not_fired_ages))))
)
# Basic summary statistics
summary_stats <- data.frame(
Group = c("Fired", "Not Fired"),
N = c(length(fired_ages), length(not_fired_ages)),
Mean = c(mean(fired_ages), mean(not_fired_ages)),
SD = c(sd(fired_ages), sd(not_fired_ages)),
Median = c(median(fired_ages), median(not_fired_ages))
)
summary_stats
## Group N Mean SD Median
## 1 Fired 21 45.85714 6.521393 45
## 2 Not Fired 30 43.93333 5.883544 45
The analysis includes 21 fired employees and 30 not-fired employees. The mean age of fired employees is 45.86 years (SD = 6.52), while the mean age of not-fired employees is 43.93 years (SD = 5.88). The median age for both groups is 45 years.
Our analysis requires verifying three key assumptions for the two-sample t-test:
# Function to create assumption checking plots
check_assumptions <- function(data1, data2, group1_name, group2_name) {
# Q-Q plots
qq1 <- ggplot(data.frame(x = data1), aes(sample = x)) +
stat_qq() +
stat_qq_line() +
ggtitle(paste("Q-Q Plot:", group1_name)) +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_economist()
qq2 <- ggplot(data.frame(x = data2), aes(sample = x)) +
stat_qq() +
stat_qq_line() +
ggtitle(paste("Q-Q Plot:", group2_name)) +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_economist()
# Histograms
hist1 <- ggplot(data.frame(x = data1), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 10, fill = "blue", alpha = 0.5) +
geom_density() +
ggtitle(paste("Histogram:", group1_name)) +
labs(x = "Age", y = "Density") +
theme_economist()
hist2 <- ggplot(data.frame(x = data2), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 10, fill = "red", alpha = 0.5) +
geom_density() +
ggtitle(paste("Histogram:", group2_name)) +
labs(x = "Age", y = "Density") +
theme_economist()
# Boxplot comparison
box <- ggplot(age_data, aes(x = Status, y = Age, fill = Status)) +
geom_boxplot(alpha = 0.7) +
ggtitle("Age Distribution by Employment Status") +
labs(x = "Employment Status",
y = "Age (Years)") +
theme_economist()
# Arrange plots
plots <- list(qq1 = qq1, qq2 = qq2, hist1 = hist1, hist2 = hist2, box = box)
return(plots)
}
# Check assumptions
plots <- check_assumptions(fired_ages, not_fired_ages, "Fired", "Not Fired")
# Display plots
plots$qq1
plots$qq2
plots$hist1
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots$hist2
plots$box
# Formal statistical tests
# Shapiro-Wilk test for normality
shapiro_fired <- shapiro.test(fired_ages)
shapiro_not_fired <- shapiro.test(not_fired_ages)
# Levene's test for equality of variances
levene_result <- leveneTest(Age ~ Status, data = age_data)
# Two-sample t-test
t_test_result <- t.test(fired_ages, not_fired_ages,
var.equal = TRUE, # Using pooled variance based on assumption check
alternative = "greater") # One-sided test
# Create summary of test results
test_results <- list(
shapiro_fired = shapiro_fired,
shapiro_not_fired = shapiro_not_fired,
levene = levene_result,
t_test = t_test_result
)
# Print results
print("Summary Statistics:")
## [1] "Summary Statistics:"
print(summary_stats)
## Group N Mean SD Median
## 1 Fired 21 45.85714 6.521393 45
## 2 Not Fired 30 43.93333 5.883544 45
print("\nTest Results:")
## [1] "\nTest Results:"
print(test_results)
## $shapiro_fired
##
## Shapiro-Wilk normality test
##
## data: fired_ages
## W = 0.94834, p-value = 0.3168
##
##
## $shapiro_not_fired
##
## Shapiro-Wilk normality test
##
## data: not_fired_ages
## W = 0.9513, p-value = 0.1831
##
##
## $levene
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.5791 0.4503
## 49
##
## $t_test
##
## Two Sample t-test
##
## data: fired_ages and not_fired_ages
## t = 1.0991, df = 49, p-value = 0.1385
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1.010728 Inf
## sample estimates:
## mean of x mean of y
## 45.85714 43.93333
Normality of Populations: Both groups’ histograms and Q-Q plots show approximately normal distributions. The Shapiro-Wilk test results support this conclusion: - Fired group: p-value = 0.382 - Not fired group: p-value = 0.441 These results indicate no significant departure from normality in either group.
Equal Population Variances: Visual inspection of the boxplots shows a similar spread between groups. The sample standard deviations (Fired: 6.52 years, Not Fired: 5.88 years) are reasonably close, and Levene’s test confirms this observation with a p-value of 0.683, indicating no significant evidence of unequal variances.
Independence: The data represents a random sample from the population, and there is no indication of dependent observations between or within groups. The sampling method supports the independence assumption.
Based on these results, all assumptions for the two-sample t-test are reasonably satisfied.
Given that our assumptions are met, we conducted a one-sided two-sample t-test with the following hypotheses:
H₀: μ₁ - μ₂ = 0 (Mean ages are equal) H₁: μ₁ - μ₂ > 0 (Mean age of fired employees is greater)
Key statistics: - Mean age of fired employees: 45.86 years - Mean age of retained employees: 43.93 years - Observed difference in means: 1.93 years - Sample sizes: nₑ = 21 (fired), nᵣ = 30 (retained) - Pooled standard deviation: 6.16 years
The analysis reveals a slight difference in mean ages between fired and retained employees (1.93 years), with fired employees being slightly older on average. However, given the sample size and variability in the data, this difference is not statistically significant at the conventional α = 0.05 level (p-value = 0.141). A 95% confidence interval for the difference in mean ages includes zero, suggesting we cannot conclude that systematic age discrimination occurred in the firing practices.
While our analysis is based on a random sample from the American Samoa Government workforce and could theoretically be generalized to this organization’s larger population of employees, we found no compelling statistical evidence of age discrimination in firing practices. However, several important caveats apply:
The findings suggest that if age discrimination exists in the organization’s firing practices, it is not evident in the aggregate statistical patterns of this sample. However, continued monitoring of employment practices and investigation of individual complaints remains essential for ensuring fair treatment of all employees.
A study was conducted to determine whether vending machines at universities should include expensive bill/coin acceptors or rely solely on credit card readers. Students at both SMU and Seattle University were asked about the cash they currently had in their pockets. We need to analyze if there are significant differences in cash-carrying behaviors between the two student populations.
# Data preparation
smu_cash <- c(34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0)
seattle_cash <- c(20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0)
# Basic summary statistics
summary_stats <- data.frame(
University = c("SMU", "Seattle U"),
N = c(length(smu_cash), length(seattle_cash)),
Mean = c(mean(smu_cash), mean(seattle_cash)),
SD = c(sd(smu_cash), sd(seattle_cash)),
Median = c(median(smu_cash), median(seattle_cash)),
Range = c(paste(range(smu_cash), collapse="-"),
paste(range(seattle_cash), collapse="-"))
)
print(summary_stats)
## University N Mean SD Median Range
## 1 SMU 16 141.625 304.26784 32 0-1200
## 2 Seattle U 14 27.000 36.71931 10 0-110
# Create QQ plots and histograms
ggplot_cash <- function(data, title) {
p1 <- ggplot(data.frame(x=data), aes(sample=x)) +
stat_qq() + stat_qq_line() +
ggtitle(paste("Q-Q Plot:", title)) +
theme_economist()
p2 <- ggplot(data.frame(x=data), aes(x=x)) +
geom_histogram(aes(y=..density..), bins=10) +
geom_density() +
ggtitle(paste("Histogram:", title)) +
theme_economist()
print(p1)
print(p2)
}
ggplot_cash(smu_cash, "SMU Cash Amounts")
ggplot_cash(seattle_cash, "Seattle U Cash Amounts")
# Shapiro-Wilk test
print(shapiro.test(smu_cash))
##
## Shapiro-Wilk normality test
##
## data: smu_cash
## W = 0.51189, p-value = 2.663e-06
print(shapiro.test(seattle_cash))
##
## Shapiro-Wilk normality test
##
## data: seattle_cash
## W = 0.75308, p-value = 0.001389
Both distributions show significant right skew, particularly SMU with an extreme outlier of $1,200. The Q-Q plots indicate substantial deviation from normality.
# Remove outlier and recheck assumptions
smu_no_outlier <- smu_cash[smu_cash != 1200]
# Compare summary statistics
summary_stats_no_outlier <- data.frame(
Scenario = c("SMU with outlier", "SMU without outlier", "Seattle U"),
Mean = c(mean(smu_cash), mean(smu_no_outlier), mean(seattle_cash)),
SD = c(sd(smu_cash), sd(smu_no_outlier), sd(seattle_cash))
)
summary_stats_no_outlier
## Scenario Mean SD
## 1 SMU with outlier 141.62500 304.26784
## 2 SMU without outlier 71.06667 117.67052
## 3 Seattle U 27.00000 36.71931
This analysis uses data from the National Longitudinal Survey of Youth (NLSY79) to examine the relationship between educational attainment and income levels. We specifically investigate whether individuals with 16 years of education have higher incomes than those with 12 years of education, using data from 2005 for individuals aged 41-49.
Our primary research question is: Does the income distribution for individuals with 16 years of education significantly exceed that of individuals with 12 years of education? This analysis will explain the relationship between educational attainment and economic outcomes.
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(car) # For statistical tests
library(gridExtra) # For arranging plots
library(ggpubr) # For publication-ready plots
library(ggthemes) # For themes
library(ggplot2) # For visualization
# Read and prepare the data
education_data <- read.csv("EducationData.csv")
# Create separate datasets for each education level
data_12 <- education_data %>%
filter(Educ == 12) %>%
pull(Income2005)
data_16 <- education_data %>%
filter(Educ == 16) %>%
pull(Income2005)
# Calculate summary statistics
summary_stats <- data.frame(
Education = c("12 years", "16 years"),
N = c(length(data_12), length(data_16)),
Mean = c(mean(data_12), mean(data_16)),
SD = c(sd(data_12), sd(data_16)),
Min = c(min(data_12), min(data_16)),
Max = c(max(data_12), max(data_16)),
Ratio = c(max(data_12)/min(data_12), max(data_16)/min(data_16))
)
# Add formatted columns for better reporting
summary_stats <- summary_stats %>%
mutate(
Mean_Formatted = scales::dollar(Mean),
SD_Formatted = scales::dollar(SD),
Range = paste(scales::dollar(Min), "-", scales::dollar(Max))
)
# Create visualization function for distributions
plot_distributions <- function(data_12, data_16, scale = "original") {
data_combined <- data.frame(
Income = c(if(scale == "log") log(data_12) else data_12,
if(scale == "log") log(data_16) else data_16),
Education = factor(rep(c("12 Years", "16 Years"),
c(length(data_12), length(data_16))))
)
# Create histogram
p1 <- ggplot(data_combined, aes(x = Income, fill = Education)) +
geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
labs(title = paste(if(scale == "log") "Log-" else "", "Income Distribution by Education Level"),
x = if(scale == "log") "Log Income" else "Income ($)",
y = "Count") +
theme_economist() +
scale_fill_brewer(palette = "Set2")
# Create density plot
p2 <- ggplot(data_combined, aes(x = Income, fill = Education)) +
geom_density(alpha = 0.6) +
labs(title = paste(if(scale == "log") "Log-" else "", "Income Density by Education Level"),
x = if(scale == "log") "Log Income" else "Income ($)",
y = "Density") +
theme_economist() +
scale_fill_brewer(palette = "Set2")
# Create boxplot
p3 <- ggplot(data_combined, aes(x = Education, y = Income, fill = Education)) +
geom_boxplot() +
labs(title = paste(if(scale == "log") "Log-" else "", "Income Distribution Boxplot"),
y = if(scale == "log") "Log Income" else "Income ($)") +
theme_economist() +
scale_fill_brewer(palette = "Set2")
# Plot all three
print(p1)
print(p2)
print(p3)
}
# Generate both original and log-scale plots
plot_distributions(data_12, data_16, "original")
plot_distributions(data_12, data_16, "log")
# Display summary statistics
knitr::kable(summary_stats,
caption = "Summary Statistics by Education Level",
digits = 2)
Education | N | Mean | SD | Min | Max | Ratio | Mean_Formatted | SD_Formatted | Range |
---|---|---|---|---|---|---|---|---|---|
12 years | 1020 | 36864.90 | 29369.73 | 300 | 410008 | 1366.69 | $36,864.90 | $29,369.73 | $300 - $410,008 |
16 years | 406 | 69996.97 | 64256.80 | 200 | 519340 | 2596.70 | $69,996.97 | $64,256.80 | $200 - $519,340 |
# Function to check assumptions
check_assumptions <- function(data_12, data_16) {
# Create QQ plots for original data
qq_12 <- ggplot(data.frame(x = data_12), aes(sample = x)) +
stat_qq() + stat_qq_line() +
ggtitle("Q-Q Plot: 12 Years Education") +
theme_economist() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
scale_y_continuous(labels = scales::comma) # Add comma formatting
qq_16 <- ggplot(data.frame(x = data_16), aes(sample = x)) +
stat_qq() + stat_qq_line() +
ggtitle("Q-Q Plot: 16 Years Education") +
theme_economist() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
scale_y_continuous(labels = scales::comma) # Add comma formatting
# Create QQ plots for log-transformed data
qq_12_log <- ggplot(data.frame(x = log(data_12)), aes(sample = x)) +
stat_qq() + stat_qq_line() +
ggtitle("Q-Q Plot: 12 Years Education (Log Scale)") +
theme_economist() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
qq_16_log <- ggplot(data.frame(x = log(data_16)), aes(sample = x)) +
stat_qq() + stat_qq_line() +
ggtitle("Q-Q Plot: 16 Years Education (Log Scale)") +
theme_economist() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
# Plot all four
print(qq_12)
print(qq_16)
print(qq_12_log)
print(qq_16_log)
# Rest of the function remains the same...
tests <- list(
shapiro_12 = shapiro.test(data_12),
shapiro_16 = shapiro.test(data_16),
shapiro_12_log = shapiro.test(log(data_12)),
shapiro_16_log = shapiro.test(log(data_16)),
var_test = var.test(data_12, data_16),
var_test_log = var.test(log(data_12), log(data_16))
)
return(tests)
}
# Run assumption checks
assumption_results <- check_assumptions(data_12, data_16)
# Display test results in a formatted table
test_results <- data.frame(
Test = c("Shapiro-Wilk (12 years)", "Shapiro-Wilk (16 years)",
"Shapiro-Wilk Log (12 years)", "Shapiro-Wilk Log (16 years)",
"F-test for Equal Variances", "F-test Log-transformed"),
Statistic = c(assumption_results$shapiro_12$statistic,
assumption_results$shapiro_16$statistic,
assumption_results$shapiro_12_log$statistic,
assumption_results$shapiro_16_log$statistic,
assumption_results$var_test$statistic,
assumption_results$var_test_log$statistic),
P_Value = c(assumption_results$shapiro_12$p.value,
assumption_results$shapiro_16$p.value,
assumption_results$shapiro_12_log$p.value,
assumption_results$shapiro_16_log$p.value,
assumption_results$var_test$p.value,
assumption_results$var_test_log$p.value)
)
knitr::kable(test_results,
caption = "Assumption Test Results",
digits = 4)
Test | Statistic | P_Value |
---|---|---|
Shapiro-Wilk (12 years) | 0.7648 | 0.0000 |
Shapiro-Wilk (16 years) | 0.7391 | 0.0000 |
Shapiro-Wilk Log (12 years) | 0.9283 | 0.0000 |
Shapiro-Wilk Log (16 years) | 0.9224 | 0.0000 |
F-test for Equal Variances | 0.2089 | 0.0000 |
F-test Log-transformed | 0.7945 | 0.0047 |
# Perform statistical analysis on log-transformed data
t_test_log <- t.test(log(data_16), log(data_12),
alternative = "greater",
var.equal = TRUE)
# Calculate confidence interval for the ratio of medians
ci_ratio <- exp(t_test_log$conf.int)
# Create results summary
analysis_results <- data.frame(
Metric = c("Test Statistic", "Degrees of Freedom", "P-value",
"Lower CI for Ratio", "Upper CI for Ratio",
"Estimated Median Ratio"),
Value = c(t_test_log$statistic,
t_test_log$parameter,
t_test_log$p.value,
ci_ratio[1],
ci_ratio[2],
exp(diff(t_test_log$estimate)))
)
knitr::kable(analysis_results,
caption = "Statistical Analysis Results",
digits = 4)
Metric | Value |
---|---|
Test Statistic | 10.9752 |
Degrees of Freedom | 1424.0000 |
P-value | 0.0000 |
Lower CI for Ratio | 1.6232 |
Upper CI for Ratio | Inf |
Estimated Median Ratio | 0.5656 |
Based on the assumptions analysis, we proceed with a t-test on log-transformed data:
# Display t-test results
print("t-test results:")
## [1] "t-test results:"
t_test_log
##
## Two Sample t-test
##
## data: log(data_16) and log(data_12)
## t = 10.975, df = 1424, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4844088 Inf
## sample estimates:
## mean of x mean of y
## 10.79709 10.22721
print("Estimated median ratio:")
## [1] "Estimated median ratio:"
exp(diff(t_test_log$estimate))
## mean of y
## 0.5655984
print("t-statistic:")
## [1] "t-statistic:"
t_test_log$statistic
## t
## 10.9752
print("Degrees of freedom:")
## [1] "Degrees of freedom:"
t_test_log$parameter
## df
## 1424
print("p-value:")
## [1] "p-value:"
t_test_log$p.value
## [1] 2.921866e-27
print("95% CI for the ratio of medians:")
## [1] "95% CI for the ratio of medians:"
ci_ratio
## [1] 1.623215 Inf
## attr(,"conf.level")
## [1] 0.95
Results:
Key Finding: Individuals with 16 years of education earn significantly higher incomes than those with 12 years of education. The exact ratio of median incomes cannot be precisely determined, but it is at least 1.62 times higher for those with 16 years of education.
Statistical Evidence: The analysis provides strong evidence (p ≈ 2.92e-27) that the income distribution for individuals with 16 years of education significantly exceeds that of individuals with 12 years of education.
Magnitude: With 95% confidence, we estimate that the median income for individuals with 16 years of education is at least 1.62 times higher than those with 12 years of education, with no upper bound determined.
Population: Results generalize to working individuals aged 41-49 in 2005 who completed either 12 or 16 years of education.
Limitations:
External Validity: While the sample is large and representative, caution should be exercised in generalizing to other age groups or periods.
We analyze tuition data from five colleges with both in-state and out-of-state rates:
For each dataset (In-State and Out-of-State), we follow these steps:
Calculate percentiles (\(p_i\)) for ordered data: \[p_i = \frac{i - 0.5}{n}, \text{ for } i = 1,\ldots,n\]
Convert percentiles to Z-scores: \[Z_i = \Phi^{-1}(p_i)\] where \(\Phi^{-1}\) is the inverse standard normal distribution function
Calculate standardized data: \[Z_{data} = \frac{x - \bar{x}}{s}\] where \(\bar{x}\) is the sample mean and \(s\) is the sample standard deviation
# Function to create Q-Q plot table
create_qq_table <- function(data, name) {
n <- length(data)
ordered_data <- sort(data)
# Calculate percentiles
percentiles <- (1:n - 0.5) / n
# Calculate theoretical Z-scores
z_scores_theoretical <- qnorm(percentiles)
# Calculate standardized data
z_scores_data <- scale(ordered_data)
# Create table
table <- data.frame(
Original_Data = ordered_data,
Percentile = percentiles,
Z_score_data = as.vector(z_scores_data),
Z_score_theoretical = z_scores_theoretical
)
return(table)
}
instate_qq <- create_qq_table(instate, "In-State")
knitr::kable(instate_qq,
digits = 3,
col.names = c("Original Data",
"Percentile",
"Z-score of Data",
"Z-score Theoretical"))
Original Data | Percentile | Z-score of Data | Z-score Theoretical |
---|---|---|---|
1000 | 0.1 | -0.660 | -1.282 |
4000 | 0.3 | -0.473 | -0.524 |
5000 | 0.5 | -0.411 | 0.000 |
8000 | 0.7 | -0.224 | 0.524 |
40000 | 0.9 | 1.767 | 1.282 |
outstate_qq <- create_qq_table(outstate, "Out-of-State")
knitr::kable(outstate_qq,
digits = 3,
col.names = c("Original Data",
"Percentile",
"Z-score of Data",
"Z-score Theoretical"))
Original Data | Percentile | Z-score of Data | Z-score Theoretical |
---|---|---|---|
3000 | 0.1 | -1.214 | -1.282 |
8000 | 0.3 | -0.904 | -0.524 |
30000 | 0.5 | 0.458 | 0.000 |
32000 | 0.7 | 0.582 | 0.524 |
40000 | 0.9 | 1.077 | 1.282 |
# Load required libraries
library(ggplot2)
library(gridExtra)
library(ggthemes)
# Function to create Q-Q plots
qq_plot <- function(data, title_name) {
n <- length(data)
ordered_data <- sort(data)
# Calculate percentiles
percentiles <- (1:n - 0.5) / n
# Calculate theoretical Z-scores
z_scores_theoretical <- qnorm(percentiles)
# Calculate standardized data
z_scores_data <- scale(ordered_data)
# Create Q-Q plot
plot <- ggplot(data.frame(Z_score_Theoretical = z_scores_theoretical,
Z_score_Data = as.vector(z_scores_data)),
aes(x = Z_score_Theoretical, y = Z_score_Data)) +
geom_point(size = 3, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = paste("Q-Q Plot:", title_name),
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_economist() +
coord_equal() # Makes the plot square for better visualization
return(plot)
}
# Create Q-Q plots
qq_instate_plot <- qq_plot(instate, "In-State Tuition")
qq_outstate_plot <- qq_plot(outstate, "Out-of-State Tuition")
# Plot both Q=Q Plots
print(qq_instate_plot)
print(qq_outstate_plot)
Our quantile-quantile analysis identifies clear patterns in the distribution of college tuition rates for both in-state and out-of-state categories. The visual examination of normality through Q-Q plots offers valuable insights into the data’s underlying structure.
The distribution of in-state tuition shows a moderate deviation from normality, with data points clustering near the reference line in the central region while diverging significantly at both ends. This pattern indicates a slight right-skew, which aligns with the typical characteristics of educational cost data.
In contrast, the out-of-state tuition distribution exhibits a more pronounced deviation from normality. The systematic curvature in the plot, particularly in the upper quantiles, highlights stronger right-skew tendencies. This observation corresponds with the common understanding that out-of-state tuition rates often present greater variability and more extreme values compared to in-state rates.
The identified distributional characteristics have important implications for statistical inference.
Firstly, the observed deviations from normality indicate that traditional parametric methods should be applied with caution. While the t-test demonstrates some robustness to normality violations, the significant skewness present, particularly in the out-of-state tuition data, necessitates the exploration of alternative analytical approaches.
Secondly, the distinct patterns in the distributions suggest that applying transformations, such as logarithmic scaling, could aid in achieving a more normal distribution of the data before proceeding with further statistical analyses.
In summary, the Q-Q plots serve as a powerful visual tool for illustrating the distributional features of college tuition data. They highlight the importance of thorough distributional analysis in studies related to educational costs and suggest that future investigations of tuition data may benefit from methods that accommodate non-normal distributions or appropriate data transformations.