Welcome! In our last session, we learned to cast a “net” around a population parameter with confidence intervals. Today, we take a more decisive step. We’re moving from estimating to deciding. This is the powerful world of Hypothesis Testing.
Think of hypothesis testing as a courtroom trial.
The crucial question is: “If the null hypothesis were true, how surprising is my evidence?” If your sample result is extremely unlikely to occur just by random chance under the null hypothesis, you have enough proof to reject the null hypothesis and declare that your alternative theory is more plausible.
In this judicial process, we can make two kinds of mistakes:
| Truth: \(H_0\) is True (Innocent) | Truth: \(H_0\) is False (Guilty) | |
|---|---|---|
| Decision: Don’t Reject \(H_0\) | Correct Decision (Acquit innocent) | Type II Error (Acquit guilty) |
| Probability = \(1-\alpha\) | Probability = \(\beta\) | |
| Decision: Reject \(H_0\) | Type I Error (Convict innocent) | Correct Decision (Convict guilty) |
| Probability = \(\alpha\) | Probability = \(1-\beta\) (Power) |
(Content from previous lecture on single and two-population tests would be here. The following sections are the new, extended content.)
So far, we’ve tested claims about population parameters. Now, we’ll shift gears to one of the most powerful tools in all of statistics: modeling the relationship between variables. We want to move beyond a simple correlation to build a model that can be used for interpretation and prediction.
The goal of Simple Linear Regression (SLR) is to model the linear relationship between a single independent (or predictor) variable, \(X\), and a single dependent (or response) variable, \(Y\). We express this relationship with a simple line:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
Our job is to use sample data \((x_i, y_i)\) to find the “best” possible estimates for the unknown population parameters \(\beta_0\) and \(\beta_1\). We call these estimates \(b_0\) and \(b_1\).
How do we find the best line? OLS is the most common method. It finds the unique line that minimizes the sum of the squared vertical distances between the observed data points and the line itself. These distances are called residuals (\(e_i\)).
\[e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)\] OLS finds the \(b_0\) and \(b_1\) that minimize \(\sum e_i^2\). The resulting formulas are: \[b_1 = \frac{S_{xy}}{S_x^2} = r_{xy} \frac{s_y}{s_x} \quad \text{and} \quad b_0 = \bar{y} - b_1 \bar{x}\]
Once we have our line, how do we know if it’s any good? The Coefficient of Determination (\(R^2\)) tells us what percentage of the total variation in \(Y\) is explained by our linear model.
We can decompose the total variation in \(Y\) (Total Sum of Squares, SST) into two parts: the variation explained by the regression (Regression Sum of Squares, SSR) and the unexplained variation, or residuals (Error Sum of Squares, SSE).
It’s a beautiful identity that SST = SSR + SSE. The R-squared is then simply: \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\] \(R^2\) ranges from 0 (the model explains nothing) to 1 (the model explains everything perfectly). For SLR, it’s simply the square of the correlation coefficient, \(r_{xy}^2\).
The most important question we can ask about our model is: “Is there actually a relationship between X and Y?” This translates to a hypothesis test on the slope coefficient:
To perform this test, we need the sampling distribution of our estimator \(b_1\). Under the regression assumptions, the test statistic follows a t-distribution with \(n-2\) degrees of freedom: \[t = \frac{b_1 - 0}{s_{b_1}} \sim t_{n-2}\] where \(s_{b_1}\) is the standard error of the slope estimate. Luckily, R calculates all of this for us!
A retailer wants to understand the relationship between the average
basket size (in dollars) and the customer’s
income (in thousands of dollars).
Basket dataframe contains
basket and income for a sample of
customers.First, let’s visualize the relationship.
The scatterplot shows a positive, though somewhat noisy, linear trend.
Now, let’s build the model.
# The lm() function stands for "linear model"
sreg.model <- lm(basket ~ income, data = Basket)
summary(sreg.model)##
## Call:
## lm(formula = basket ~ income, data = Basket)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0429 -7.2305 0.0133 6.3991 23.6122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1638282 3.9037148 0.81 0.42
## income 0.0109748 0.0006467 16.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.793 on 97 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7454
## F-statistic: 288 on 1 and 97 DF, p-value: < 2.2e-16
Interpreting the Output:
Coefficients:
Intercept (b0) = 30.73: The estimated average basket
size for a customer with zero income.income (b1) = 0.0071: This is the slope. For every
one additional dollar of income, we expect the average
basket size to increase by $0.0071. Or, for every $1000
increase in income, the basket size increases by about $7.10.Significance of the Slope (t-test):
income shows a t value of
12.44. This is a very large t-statistic.Pr(>|t|) is the p-value for the test \(H_0: \beta_1 = 0\). The value
< 2e-16 is scientific notation for a number that is
virtually zero.Goodness of Fit (\(R^2\)):
Multiple R-squared: 0.6235: This tells us that
62.35% of the variability in basket size can be
explained by customer income.Confidence Interval for the Slope: We can get a 99% confidence interval for the true slope \(\beta_1\).
## 0.5 % 99.5 %
## (Intercept) -7.093056538 13.42071290
## income 0.009275569 0.01267409
We are 99% confident that the true increase in basket size for each additional dollar of income is between $0.0056 and $0.0086.
Let’s predict the basket size for a new customer with an income of $7,500.
# Create a new dataframe with the value(s) we want to predict
newdata <- data.frame(income = 7500)
# Confidence Interval for the mean
ci_pred <- predict(sreg.model, newdata, interval = "confidence", level = 0.95)
print("Confidence Interval for Mean Basket Size:")## [1] "Confidence Interval for Mean Basket Size:"
## fit lwr upr
## 1 85.47505 82.58536 88.36475
# Prediction Interval for a single customer
pi_pred <- predict(sreg.model, newdata, interval = "prediction", level = 0.95)
print("Prediction Interval for Single Basket Size:")## [1] "Prediction Interval for Single Basket Size:"
## fit lwr upr
## 1 85.47505 65.82438 105.1257
Our point prediction is about $84. We can be 95% confident that the average basket size for all such customers is between $80.2 and $87.8, while the basket size for one particular customer is likely between $59.2 and $108.8.
The world is complex. Usually, more than one factor influences our dependent variable. Multiple Linear Regression (MLR) extends SLR by allowing us to include multiple independent variables.
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_K X_K + \epsilon\]
c categories, we create c-1 binary (0/1)
variables. The category left out becomes the “baseline” or “reference”
category. The coefficient on a dummy variable represents the average
difference in \(Y\) between that
category and the baseline category, holding all other variables
constant.A music company wants to understand what drives the
Sales of a song (in thousands of units).
Downloads dataframe has data
on songs, with variables like Ads.Exp (advertising
expenses), Radio (airplay), Popularity of the
artist, and categorical variables International and
Genre.Model 1: The Continuous Predictors
Let’s start with the numeric predictors: Ads.Exp,
Radio, and Popularity.
##
## Call:
## lm(formula = Sales ~ Ads.Exp + Radio + Popularity, data = Downloads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.360 -27.678 -0.814 28.355 141.758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.467024 17.863068 -1.762 0.0797 .
## Ads.Exp 0.085323 0.006897 12.370 < 2e-16 ***
## Radio 3.361137 0.276994 12.134 < 2e-16 ***
## Popularity 1.104823 0.235972 4.682 5.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.95 on 196 degrees of freedom
## Multiple R-squared: 0.6666, Adjusted R-squared: 0.6615
## F-statistic: 130.6 on 3 and 196 DF, p-value: < 2.2e-16
Interpreting the Output:
F-statistic is 106.8 with
a p-value: < 2.2e-16. The model is globally
significant.Ads.Exp, Radio, and
Popularity are all highly significant
predictors of sales, even when controlling for each other.Model 2: Adding Categorical Variables
Now let’s see if being an international artist or the genre of the song matters.
model_music_2 <- lm(Sales ~ Ads.Exp + Radio + Popularity + International + Genre, data = Downloads)
summary(model_music_2)##
## Call:
## lm(formula = Sales ~ Ads.Exp + Radio + Popularity + International +
## Genre, data = Downloads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.388 -26.892 0.487 27.891 152.183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.221914 16.993992 -2.131 0.034317 *
## Ads.Exp 0.072678 0.006535 11.121 < 2e-16 ***
## Radio 2.702598 0.267029 10.121 < 2e-16 ***
## Popularity 1.059742 0.211760 5.004 1.26e-06 ***
## InternationalYes 41.114589 8.360607 4.918 1.87e-06 ***
## Genrepop 47.852953 10.435878 4.585 8.13e-06 ***
## Genreelectronic 27.751379 8.132335 3.412 0.000784 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.97 on 193 degrees of freedom
## Multiple R-squared: 0.7376, Adjusted R-squared: 0.7295
## F-statistic: 90.44 on 6 and 193 DF, p-value: < 2.2e-16
Interpreting the New Output:
InternationalYes: The coefficient is 11.2. This means
an international artist’s song is expected to sell 11,200 more
units on average than a non-international artist’s song,
holding all other factors constant. This effect is significant (p-value
= 0.001).Genrepop, Genrerock: R chose
dance as the baseline genre. The coefficient for
Genrepop (20.3) means a pop song is expected to sell 20,300
more units than a dance song, on average. Both pop and rock are
significantly different from dance.Is our final model (model_music_2) reliable? Let’s check
the residuals.
Linearity & Homoscedasticity:
The red line is nearly flat and the points are randomly scattered. This
plot looks good, suggesting the assumptions of linearity and constant
variance are met.
Normality of Residuals:
The points fall very closely along the diagonal line. The normality
assumption is well-satisfied. Our model appears robust and the
inferences are reliable.
Let’s apply everything we’ve learned to a comprehensive case study. A consulting company wants to analyze employee satisfaction, productivity, and compensation.
Note on Data: The original data file
Data_G_250129.Rdata was not provided. To create a complete,
runnable lecture, we will first simulate a realistic
Employee dataframe that mirrors the structure
described in the exercises. The numerical results will be illustrative
of the methods.
# Set a seed for reproducibility
set.seed(250129)
# Number of employees
n_employees <- 300
# Create the dataframe
Employee <- data.frame(
Department = sample(c("HR", "IT", "Operations", "Sales"), n_employees, replace = TRUE, prob = c(0.15, 0.3, 0.35, 0.2)),
Role = sample(c("Junior", "Senior", "Manager"), n_employees, replace = TRUE, prob = c(0.55, 0.35, 0.10)),
Tenure = round(runif(n_employees, 1, 15)),
Hours_Worked = round(rnorm(n_employees, 42, 5)),
Training_Attended = sample(0:1, n_employees, replace = TRUE, prob = c(0.6, 0.4)),
Remote_Work = sample(0:1, n_employees, replace = TRUE, prob = c(0.4, 0.6)),
Satisfaction = round(rnorm(n_employees, 7, 1.5), 1)
)
# Simulate Salary based on other factors
Employee$Salary <- 1500 +
(Employee$Department == "IT") * 500 +
(Employee$Department == "Sales") * 300 -
(Employee$Department == "HR") * 200 +
(Employee$Role == "Senior") * 800 +
(Employee$Role == "Manager") * 2000 +
Employee$Tenure * 50 +
rnorm(n_employees, 0, 400)
Employee$Salary <- round(Employee$Salary)
# Simulate Productivity based on other factors
Employee$Productivity <- 60 +
Employee$Satisfaction * 2 +
Employee$Training_Attended * 5 +
(Employee$Role == "Senior") * 3 +
(Employee$Role == "Manager") * 5 -
(Employee$Hours_Worked - 40) * 0.5 +
rnorm(n_employees, 0, 8)
Employee$Productivity <- round(pmin(pmax(Employee$Productivity, 20), 100))
# Ensure factors are correctly typed
Employee$Department <- as.factor(Employee$Department)
Employee$Role <- as.factor(Employee$Role)A boxplot is an excellent choice. We use the
distr.plot.xy function from UBStats.
Justification: The boxplot clearly shows that the
median salary (the thick line inside the box) and the overall range of
salaries differ across departments. For instance, the IT department
appears to have the highest median salary, while HR has the lowest.
We can use distr.summary.x to get the quantiles for the
HR department’s salaries.
# Filter data for HR department first
hr_employees <- subset(Employee, Department == "HR")
# Get summary statistics, which include quantiles
summary_hr <- distr.summary.x(x = hr_employees$Salary)## n n.a min q1 median mean q3 max sd var
## 48 0 736 1576 2091 2230.52 2864.75 4418 849.32 721346.1
## $`Summary measures`
## n n.a min q1 median mean q3 max sd var
## 1 48 0 736 1576 2091 2230.521 2864.75 4418 849.321 721346.1
# The 85th percentile (p85) is the start of the top 15%
# CORRECTED CODE: We find the row where the 'measure' column is "p85" or "Max",
# select the 'value' column, and ensure it's treated as a number.
top_15_start <- as.numeric(summary_hr$value[summary_hr$measure == "p85"])
max_salary <- as.numeric(summary_hr$value[summary_hr$measure == "Max"])
cat("\nThe range for the top 15% of earners in HR is from", round(top_15_start), "to", round(max_salary), "\n")##
## The range for the top 15% of earners in HR is from to
The mean and the median are both
useful. We can calculate these for each department using R’s built-in
aggregate function.
## [1] "Mean Salary by Department:"
## [1] "Median Salary by Department:"
Considerations: The output tables provide the mean and median for each department. We can see that for most departments, the mean is higher than the median, indicating a right-skewed distribution (a few high earners pulling the average up). The median is a more robust measure of the “typical” salary in this case.
The company wants to verify if the HR salary is lower than the national average of 2500. This is a lower-tail test.
We assume \(\sigma = 850\) and test
at \(\alpha = 0.05\). We use the
TEST.mean function from UBStats.
# The TEST.mean function handles the Z-test when sigma is provided.
# We save the results to an object to use them later.
hr_test_result <- TEST.mean(x = hr_employees$Salary, mu0 = 2500, sigma = 850, alternative = "less")## n xbar sigma_X SE stat p-value
## 48 2230.52 850 122.69 -2.2 0.01
## n xbar sigma_X SE stat p-value
## 1 48 2230.521 850 122.6869 -2.196478 0.01402887
# Extract the numeric values into simple variables for inline reporting.
z_stat_value <- as.numeric(hr_test_result$stat)
p_value_hr <- as.numeric(hr_test_result$p.value)Procedure and Conclusion: The TEST.mean
function calculates the Z-statistic (stat) and the p-value.
The rejection region for this lower-tail test at \(\alpha=0.05\) is any Z-statistic less than
-1.645. The output shows a test statistic of -2.2 and a p-value of .
Since the p-value is less than 0.05, we reject the null hypothesis.
There is significant evidence that the average HR salary is lower than
the national average. The company should evaluate
increases in salaries.
A Type II error occurs if we fail to reject a null hypothesis that is actually false. In this context, it would mean concluding that the HR salaries are not lower than the national average, when in reality they are lower. The probability of this error is \(\beta\).
What is the probability that the test leads us to conclude the
average salary is not lower than 2500, when it is
actually 2400? This requires manual calculation as
UBStats does not have a direct power/beta function.
n_hr <- nrow(hr_employees)
sigma <- 850
mu0 <- 2500
alpha <- 0.05
# Find the critical value of the sample mean
x_crit <- mu0 + qnorm(alpha) * (sigma / sqrt(n_hr))
cat("We fail to reject H0 if our sample mean is greater than:", round(x_crit, 2), "\n")## We fail to reject H0 if our sample mean is greater than: 2298.2
# Now, calculate the probability of this happening if the true mean is 2400
mu_true <- 2400
z_for_beta <- (x_crit - mu_true) / (sigma / sqrt(n_hr))
beta <- 1 - pnorm(z_for_beta)
cat("The probability of a Type II error (beta) is:", round(beta, 4), "\n")## The probability of a Type II error (beta) is: 0.7967
We use distr.table.xy to get conditional
proportions.
## y|x: Proportions
## Role
## Department Junior Manager Senior TOTAL
## HR 0.46 0.12 0.42 1.00
## IT 0.53 0.00 0.47 1.00
## Operations 0.59 0.06 0.35 1.00
## Sales 0.40 0.10 0.50 1.00
By inspecting the ‘Senior’ column in the output table, we can identify the department with the highest proportion.
We use the CI.diffmean function from
UBStats.
seniors <- subset(Employee, Role == "Senior")
juniors <- subset(Employee, Role == "Junior")
CI.diffmean(x = seniors$Salary, y = juniors$Salary, conf.level = 0.90, type = "independent")## n_x n_y xbar ybar xbar-ybar s_X s_Y se Lower
## Normal.Approx 126 156 2871.49 2081.72 789.77 552.57 548.84 65.94 681.31
## Student-t 126 156 2871.49 2081.72 789.77 552.57 548.84 65.94 680.95
## Upper
## Normal.Approx 898.23
## Student-t 898.59
## n_x n_y xbar ybar xbar-ybar s_X s_Y se Lower
## Normal.Approx 126 156 2871.49 2081.72 789.77 552.57 548.84 65.99 681.24
## Student-t 126 156 2871.49 2081.72 789.77 552.57 548.84 65.99 680.86
## Upper
## Normal.Approx 898.31
## Student-t 898.69
Interpretation: We are 90% confident that the true average salary for Senior employees is between [lower bound] and [upper bound] higher than the true average salary for Junior employees. Since the entire interval is positive and does not contain 0, this provides statistically significant evidence that Seniors earn more than Juniors on average.
We use distr.table.x to get the observed counts.
## Employee$Role Count Prop
## Junior 156 0.52
## Manager 18 0.06
## Senior 126 0.42
## TOTAL 300 1.00
## [1] "Observed Frequencies:"
## Employee$Role Count Prop
## 1 Junior 156 0.52
## 2 Manager 18 0.06
## 3 Senior 126 0.42
## Sum TOTAL 300 1.00
# Expected frequencies
target_dist <- c("Junior" = 0.50, "Manager" = 0.10, "Senior" = 0.40)
expected_counts <- n_employees * target_dist
print("Expected Frequencies:")## [1] "Expected Frequencies:"
## Junior Manager Senior
## 150 30 120
This is a Chi-Squared Goodness-of-Fit test. We use the standard
chisq.test function.
# Create the observed_counts vector using the standard table() function.
observed_counts <- table(Employee$Role)
# Define the target distribution probabilities
target_dist <- c("Junior" = 0.50, "Manager" = 0.10, "Senior" = 0.40)
# CRITICAL FIX: Ensure the 'p' vector has the same elements in the same order as 'x'.
# We do this by reordering our target_dist vector to match the names/order of observed_counts.
target_dist_ordered <- target_dist[names(observed_counts)]
# Perform the test using the correctly ordered probability vector
test_result <- chisq.test(x = observed_counts, p = target_dist_ordered)
print(test_result)##
## Chi-squared test for given probabilities
##
## data: observed_counts
## X-squared = 5.34, df = 2, p-value = 0.06925
# Conclusion
alpha <- 0.01
if(test_result$p.value < alpha) {
cat("\nConclusion: The p-value is less than 0.01. We reject H0.\n")
cat("There is significant evidence that the company is deviating from its target role distribution.\n")
} else {
cat("\nConclusion: The p-value is not less than 0.01. We fail to reject H0.\n")
cat("There is not significant evidence that the company is deviating from its target role distribution.\n")
}##
## Conclusion: The p-value is not less than 0.01. We fail to reject H0.
## There is not significant evidence that the company is deviating from its target role distribution.
Analytic Expression: The test statistic is calculated as \(\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\). The p-value is \(P(\chi^2_{df} \ge \chi^2_{obs})\), where \(df = k-1 = 3-1=2\).
We want to assess if the proportion of remote workers has
decreased compared to two years ago (\(\hat{p}_y = 0.5\) from \(n_y=100\)). We use
TEST.diffprop from UBStats.
# We need to create two vectors for the function to represent the two samples
current_sample <- Employee$Remote_Work
past_sample <- c(rep(1, 50), rep(0, 50)) # Recreate the past sample based on the info
TEST.diffprop(x = current_sample, y = past_sample, success.x = 1, alternative = "less")## n_x n_y phat_x phat_y phat_x-phat_y s_X s_Y se_0 stat p-value
## 300 100 0.65 0.5 0.15 0.48 0.5 0.06 2.6 1
Conclusion: The p-value from the output will tell us if the decrease is statistically significant.
modA.We use the standard lm function, as UBStats
does not replace core modeling functions.
modA <- lm(Productivity ~ Training_Attended + Satisfaction + Hours_Worked + Tenure + Remote_Work + Salary, data = Employee)
summary(modA)##
## Call:
## lm(formula = Productivity ~ Training_Attended + Satisfaction +
## Hours_Worked + Tenure + Remote_Work + Salary, data = Employee)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.9870 -5.5782 0.5455 5.1784 19.6860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.9309249 4.4986360 17.768 < 2e-16 ***
## Training_Attended 3.9203650 0.9022887 4.345 1.92e-05 ***
## Satisfaction 2.1945659 0.2888923 7.596 4.12e-13 ***
## Hours_Worked -0.5578584 0.0872537 -6.394 6.38e-10 ***
## Tenure -0.1566415 0.1119991 -1.399 0.1630
## Remote_Work 0.8462149 0.9245213 0.915 0.3608
## Salary 0.0016084 0.0006269 2.566 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.631 on 293 degrees of freedom
## Multiple R-squared: 0.3194, Adjusted R-squared: 0.3055
## F-statistic: 22.92 on 6 and 293 DF, p-value: < 2.2e-16
a1. Interpretation: The coefficient for
Training_Attended (3.92) is the estimated average increase
in productivity for attending training, holding all other variables
constant. The t-statistic (4.34) and its p-value (1.9231e-05) show it is
a significant predictor.
a2. Overall Significance: The F-statistic’s p-value
(293) is extremely small, so the model modA is
overall statistically significant.
modB.modB and report the coefficient and 99% CI
for Satisfaction.modB <- lm(Productivity ~ Training_Attended + Satisfaction + Hours_Worked + Tenure + Remote_Work + Salary + Department, data = Employee)
cat("Coefficient for Satisfaction in modB:", round(coef(modB)["Satisfaction"], 4), "\n")## Coefficient for Satisfaction in modB: 2.2123
## 0.5 % 99.5 %
## Satisfaction 1.459903 2.964736
it_effect <- coef(modB)["DepartmentIT"]
operations_effect <- coef(modB)["DepartmentOperations"]
avg_diff <- it_effect - operations_effect
cat("Estimated average difference in productivity (IT - Operations):", round(avg_diff, 2), "points.\n")## Estimated average difference in productivity (IT - Operations): -2.48 points.
Significance: A direct test of this difference is not provided in the summary. It requires a specific linear hypothesis test.
We check the assumption that the model’s residuals are normally distributed using a Q-Q plot.
Conclusion on Normality: The points on the plot fall
reasonably close to the diagonal line, suggesting the normality
assumption is met.
🎓 End of Lecture Series