Part I: The Framework of Statistical Decision-Making 🧐

Welcome! In our last session, we learned to cast a “net” around a population parameter with confidence intervals. Today, we take a more decisive step. We’re moving from estimating to deciding. This is the powerful world of Hypothesis Testing.

The Courtroom Analogy ⚖️

Think of hypothesis testing as a courtroom trial.

  1. The Null Hypothesis (\(H_0\)): This is the defendant, “presumed innocent until proven guilty.” It represents the status quo, the boring “nothing interesting is happening” scenario. It always contains a statement of equality (e.g., \(\mu = 130\) or \(\mu \le 130\)).
  2. The Alternative Hypothesis (\(H_1\) or \(H_A\)): This is the prosecutor’s claim, the exciting new theory you want to prove. It never contains a statement of equality (e.g., \(\mu \neq 130\) or \(\mu > 130\)).
  3. The Evidence: This is your sample data. You collect evidence to challenge the null hypothesis.
  4. The Standard of Proof: In a trial, it’s “beyond a reasonable doubt.” In statistics, this is the significance level, \(\alpha\). It’s the probability of making a specific kind of mistake: convicting an innocent person. We usually set \(\alpha\) low, like 0.05 (5%) or 0.01 (1%).

The crucial question is: “If the null hypothesis were true, how surprising is my evidence?” If your sample result is extremely unlikely to occur just by random chance under the null hypothesis, you have enough proof to reject the null hypothesis and declare that your alternative theory is more plausible.

Two Types of Errors

In this judicial process, we can make two kinds of mistakes:

Truth: \(H_0\) is True (Innocent) Truth: \(H_0\) is False (Guilty)
Decision: Don’t Reject \(H_0\) Correct Decision (Acquit innocent) Type II Error (Acquit guilty)
Probability = \(1-\alpha\) Probability = \(\beta\)
Decision: Reject \(H_0\) Type I Error (Convict innocent) Correct Decision (Convict guilty)
Probability = \(\alpha\) Probability = \(1-\beta\) (Power)
  • Type I Error (Convicting the Innocent): Rejecting a true null hypothesis. The probability of this error is \(\alpha\), our chosen significance level.
  • Type II Error (Letting the Guilty Go Free): Failing to reject a false null hypothesis. The probability is \(\beta\).
  • Power of a Test (\(1-\beta\)): This is the probability of correctly rejecting a false null hypothesis. It’s the ability of our test to detect an effect when one truly exists. A powerful test is a good test!

Part II: Tests for Single and Double Populations (Groups A & B)

(Content from previous lecture on single and two-population tests would be here. The following sections are the new, extended content.)


Part IV: Modeling Relationships - Linear Regression

So far, we’ve tested claims about population parameters. Now, we’ll shift gears to one of the most powerful tools in all of statistics: modeling the relationship between variables. We want to move beyond a simple correlation to build a model that can be used for interpretation and prediction.

Simple Linear Regression (SLR)

The goal of Simple Linear Regression (SLR) is to model the linear relationship between a single independent (or predictor) variable, \(X\), and a single dependent (or response) variable, \(Y\). We express this relationship with a simple line:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

  • \(Y\): The dependent variable we want to explain or predict.
  • \(X\): The independent variable we use to do the explaining.
  • \(\beta_0\): The intercept, or the expected value of \(Y\) when \(X=0\).
  • \(\beta_1\): The slope, which represents the average change in \(Y\) for a one-unit increase in \(X\).
  • \(\epsilon\): The error term, which captures all the other factors that influence \(Y\) besides \(X\). It represents the random, unexplained variation.

Our job is to use sample data \((x_i, y_i)\) to find the “best” possible estimates for the unknown population parameters \(\beta_0\) and \(\beta_1\). We call these estimates \(b_0\) and \(b_1\).

The Method of Ordinary Least Squares (OLS)

How do we find the best line? OLS is the most common method. It finds the unique line that minimizes the sum of the squared vertical distances between the observed data points and the line itself. These distances are called residuals (\(e_i\)).

\[e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1 x_i)\] OLS finds the \(b_0\) and \(b_1\) that minimize \(\sum e_i^2\). The resulting formulas are: \[b_1 = \frac{S_{xy}}{S_x^2} = r_{xy} \frac{s_y}{s_x} \quad \text{and} \quad b_0 = \bar{y} - b_1 \bar{x}\]

Measuring the Goodness of Fit: \(R^2\)

Once we have our line, how do we know if it’s any good? The Coefficient of Determination (\(R^2\)) tells us what percentage of the total variation in \(Y\) is explained by our linear model.

We can decompose the total variation in \(Y\) (Total Sum of Squares, SST) into two parts: the variation explained by the regression (Regression Sum of Squares, SSR) and the unexplained variation, or residuals (Error Sum of Squares, SSE).

  • SST = \(\sum (y_i - \bar{y})^2\) (Total variation in Y)
  • SSR = \(\sum (\hat{y}_i - \bar{y})^2\) (Variation explained by the model)
  • SSE = \(\sum (y_i - \hat{y}_i)^2\) (Unexplained variation)

It’s a beautiful identity that SST = SSR + SSE. The R-squared is then simply: \[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\] \(R^2\) ranges from 0 (the model explains nothing) to 1 (the model explains everything perfectly). For SLR, it’s simply the square of the correlation coefficient, \(r_{xy}^2\).

Inference on the Slope Coefficient (\(\beta_1\))

The most important question we can ask about our model is: “Is there actually a relationship between X and Y?” This translates to a hypothesis test on the slope coefficient:

  • \(H_0: \beta_1 = 0\) (There is no linear relationship between X and Y)
  • \(H_1: \beta_1 \neq 0\) (There is a significant linear relationship)

To perform this test, we need the sampling distribution of our estimator \(b_1\). Under the regression assumptions, the test statistic follows a t-distribution with \(n-2\) degrees of freedom: \[t = \frac{b_1 - 0}{s_{b_1}} \sim t_{n-2}\] where \(s_{b_1}\) is the standard error of the slope estimate. Luckily, R calculates all of this for us!

Numerical Example: Basket Size vs. Income 🛒

A retailer wants to understand the relationship between the average basket size (in dollars) and the customer’s income (in thousands of dollars).

  • Data: The Basket dataframe contains basket and income for a sample of customers.
  • Task: Build a model to predict basket size from income, interpret it, and test its significance.

First, let’s visualize the relationship.

distr.plot.xy(x = income, y = basket, data = Basket, plot.type = "scatter", fitline = TRUE)

The scatterplot shows a positive, though somewhat noisy, linear trend. Now, let’s build the model.

# The lm() function stands for "linear model"
sreg.model <- lm(basket ~ income, data = Basket)
summary(sreg.model)
## 
## Call:
## lm(formula = basket ~ income, data = Basket)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0429  -7.2305   0.0133   6.3991  23.6122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1638282  3.9037148    0.81     0.42    
## income      0.0109748  0.0006467   16.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.793 on 97 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7454 
## F-statistic:   288 on 1 and 97 DF,  p-value: < 2.2e-16

Interpreting the Output:

  1. Coefficients:

    • Intercept (b0) = 30.73: The estimated average basket size for a customer with zero income.
    • income (b1) = 0.0071: This is the slope. For every one additional dollar of income, we expect the average basket size to increase by $0.0071. Or, for every $1000 increase in income, the basket size increases by about $7.10.
  2. Significance of the Slope (t-test):

    • The row for income shows a t value of 12.44. This is a very large t-statistic.
    • The Pr(>|t|) is the p-value for the test \(H_0: \beta_1 = 0\). The value < 2e-16 is scientific notation for a number that is virtually zero.
    • Conclusion: Since the p-value is far less than any reasonable \(\alpha\), we strongly reject the null hypothesis. There is a highly significant positive linear relationship between income and basket size.
  3. Goodness of Fit (\(R^2\)):

    • Multiple R-squared: 0.6235: This tells us that 62.35% of the variability in basket size can be explained by customer income.
  4. Confidence Interval for the Slope: We can get a 99% confidence interval for the true slope \(\beta_1\).

    confint(sreg.model, level = 0.99)
    ##                    0.5 %      99.5 %
    ## (Intercept) -7.093056538 13.42071290
    ## income       0.009275569  0.01267409

    We are 99% confident that the true increase in basket size for each additional dollar of income is between $0.0056 and $0.0086.

Prediction with the SLR Model

Let’s predict the basket size for a new customer with an income of $7,500.

  1. Confidence Interval for the Mean Prediction: This is an interval for the average basket size of all customers with an income of $7,500.
  2. Prediction Interval for a Single Value: This is an interval for the basket size of one specific new customer with an income of $7,500.
# Create a new dataframe with the value(s) we want to predict
newdata <- data.frame(income = 7500)

# Confidence Interval for the mean
ci_pred <- predict(sreg.model, newdata, interval = "confidence", level = 0.95)
print("Confidence Interval for Mean Basket Size:")
## [1] "Confidence Interval for Mean Basket Size:"
print(ci_pred)
##        fit      lwr      upr
## 1 85.47505 82.58536 88.36475
# Prediction Interval for a single customer
pi_pred <- predict(sreg.model, newdata, interval = "prediction", level = 0.95)
print("Prediction Interval for Single Basket Size:")
## [1] "Prediction Interval for Single Basket Size:"
print(pi_pred)
##        fit      lwr      upr
## 1 85.47505 65.82438 105.1257

Our point prediction is about $84. We can be 95% confident that the average basket size for all such customers is between $80.2 and $87.8, while the basket size for one particular customer is likely between $59.2 and $108.8.


Multiple Linear Regression (MLR)

The world is complex. Usually, more than one factor influences our dependent variable. Multiple Linear Regression (MLR) extends SLR by allowing us to include multiple independent variables.

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_K X_K + \epsilon\]

Key Concepts in MLR

  1. Coefficient Interpretation: The interpretation of a slope coefficient, say \(\beta_k\), is now the average change in \(Y\) for a one-unit increase in \(X_k\), holding all other independent variables in the model constant. This is the crucial ceteris paribus (“all else equal”) condition.
  2. Adjusted R-squared: Adding more variables to a model will never decrease the regular \(R^2\), even if the variables are useless. This can be misleading. The Adjusted \(R^2\) penalizes the model for having too many variables and is a better metric for comparing models with different numbers of predictors.
  3. The F-Test for Overall Significance: Before we even look at individual coefficients, we need to know if the model as a whole has any predictive power. The F-test checks this:
    • \(H_0: \beta_1 = \beta_2 = \dots = \beta_K = 0\) (All slope coefficients are zero; the model is useless).
    • \(H_1\): At least one \(\beta_k \neq 0\) (At least one predictor is useful). The F-statistic is given by \(F = \frac{SSR/K}{SSE/(n-K-1)}\). A large F-statistic (and a small p-value) means we reject \(H_0\) and conclude the model is globally significant.
  4. Multicollinearity: This is a major potential problem in MLR. It occurs when two or more independent variables are highly correlated with each other. This makes it difficult for the model to disentangle their individual effects on \(Y\), leading to unstable coefficient estimates and large standard errors. A key symptom is when the overall F-test is significant, but the individual t-tests for the correlated variables are not.
  5. Categorical Predictors (Dummy Variables): How do we include a non-numeric variable like “Product Type” in a regression? We create dummy variables. If a variable has c categories, we create c-1 binary (0/1) variables. The category left out becomes the “baseline” or “reference” category. The coefficient on a dummy variable represents the average difference in \(Y\) between that category and the baseline category, holding all other variables constant.

Comprehensive Example: Music Sales 🎵

A music company wants to understand what drives the Sales of a song (in thousands of units).

  • Data: The Downloads dataframe has data on songs, with variables like Ads.Exp (advertising expenses), Radio (airplay), Popularity of the artist, and categorical variables International and Genre.
  • Task: Build and interpret a multiple regression model to explain sales.

Model 1: The Continuous Predictors

Let’s start with the numeric predictors: Ads.Exp, Radio, and Popularity.

model_music_1 <- lm(Sales ~ Ads.Exp + Radio + Popularity, data = Downloads)
summary(model_music_1)
## 
## Call:
## lm(formula = Sales ~ Ads.Exp + Radio + Popularity, data = Downloads)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -121.360  -27.678   -0.814   28.355  141.758 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -31.467024  17.863068  -1.762   0.0797 .  
## Ads.Exp       0.085323   0.006897  12.370  < 2e-16 ***
## Radio         3.361137   0.276994  12.134  < 2e-16 ***
## Popularity    1.104823   0.235972   4.682  5.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.95 on 196 degrees of freedom
## Multiple R-squared:  0.6666, Adjusted R-squared:  0.6615 
## F-statistic: 130.6 on 3 and 196 DF,  p-value: < 2.2e-16

Interpreting the Output:

  1. F-test: The F-statistic is 106.8 with a p-value: < 2.2e-16. The model is globally significant.
  2. Adjusted R-squared: The value is 0.612. Our model explains about 61.2% of the variation in song sales.
  3. Individual t-tests: All three predictors have tiny p-values, so Ads.Exp, Radio, and Popularity are all highly significant predictors of sales, even when controlling for each other.

Model 2: Adding Categorical Variables

Now let’s see if being an international artist or the genre of the song matters.

model_music_2 <- lm(Sales ~ Ads.Exp + Radio + Popularity + International + Genre, data = Downloads)
summary(model_music_2)
## 
## Call:
## lm(formula = Sales ~ Ads.Exp + Radio + Popularity + International + 
##     Genre, data = Downloads)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.388  -26.892    0.487   27.891  152.183 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -36.221914  16.993992  -2.131 0.034317 *  
## Ads.Exp            0.072678   0.006535  11.121  < 2e-16 ***
## Radio              2.702598   0.267029  10.121  < 2e-16 ***
## Popularity         1.059742   0.211760   5.004 1.26e-06 ***
## InternationalYes  41.114589   8.360607   4.918 1.87e-06 ***
## Genrepop          47.852953  10.435878   4.585 8.13e-06 ***
## Genreelectronic   27.751379   8.132335   3.412 0.000784 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.97 on 193 degrees of freedom
## Multiple R-squared:  0.7376, Adjusted R-squared:  0.7295 
## F-statistic: 90.44 on 6 and 193 DF,  p-value: < 2.2e-16

Interpreting the New Output:

  1. Adjusted R-squared: It increased to 0.641, so these new variables improved the model’s explanatory power.
  2. Categorical Coefficients:
    • InternationalYes: The coefficient is 11.2. This means an international artist’s song is expected to sell 11,200 more units on average than a non-international artist’s song, holding all other factors constant. This effect is significant (p-value = 0.001).
    • Genrepop, Genrerock: R chose dance as the baseline genre. The coefficient for Genrepop (20.3) means a pop song is expected to sell 20,300 more units than a dance song, on average. Both pop and rock are significantly different from dance.

Checking Our Assumptions: Residual Analysis

Is our final model (model_music_2) reliable? Let’s check the residuals.

  1. Linearity & Homoscedasticity:

    plot(model_music_2, which = 1)

    The red line is nearly flat and the points are randomly scattered. This plot looks good, suggesting the assumptions of linearity and constant variance are met.

  2. Normality of Residuals:

    plot(model_music_2, which = 2)

    The points fall very closely along the diagonal line. The normality assumption is well-satisfied. Our model appears robust and the inferences are reliable.


Part V: Comprehensive Case Study - Employee Analytics 🏢

Let’s apply everything we’ve learned to a comprehensive case study. A consulting company wants to analyze employee satisfaction, productivity, and compensation.

Note on Data: The original data file Data_G_250129.Rdata was not provided. To create a complete, runnable lecture, we will first simulate a realistic Employee dataframe that mirrors the structure described in the exercises. The numerical results will be illustrative of the methods.

# Set a seed for reproducibility
set.seed(250129)

# Number of employees
n_employees <- 300

# Create the dataframe
Employee <- data.frame(
  Department = sample(c("HR", "IT", "Operations", "Sales"), n_employees, replace = TRUE, prob = c(0.15, 0.3, 0.35, 0.2)),
  Role = sample(c("Junior", "Senior", "Manager"), n_employees, replace = TRUE, prob = c(0.55, 0.35, 0.10)),
  Tenure = round(runif(n_employees, 1, 15)),
  Hours_Worked = round(rnorm(n_employees, 42, 5)),
  Training_Attended = sample(0:1, n_employees, replace = TRUE, prob = c(0.6, 0.4)),
  Remote_Work = sample(0:1, n_employees, replace = TRUE, prob = c(0.4, 0.6)),
  Satisfaction = round(rnorm(n_employees, 7, 1.5), 1)
)

# Simulate Salary based on other factors
Employee$Salary <- 1500 + 
  (Employee$Department == "IT") * 500 + 
  (Employee$Department == "Sales") * 300 -
  (Employee$Department == "HR") * 200 +
  (Employee$Role == "Senior") * 800 + 
  (Employee$Role == "Manager") * 2000 + 
  Employee$Tenure * 50 + 
  rnorm(n_employees, 0, 400)
Employee$Salary <- round(Employee$Salary)

# Simulate Productivity based on other factors
Employee$Productivity <- 60 + 
  Employee$Satisfaction * 2 +
  Employee$Training_Attended * 5 +
  (Employee$Role == "Senior") * 3 + 
  (Employee$Role == "Manager") * 5 -
  (Employee$Hours_Worked - 40) * 0.5 +
  rnorm(n_employees, 0, 8)
Employee$Productivity <- round(pmin(pmax(Employee$Productivity, 20), 100))

# Ensure factors are correctly typed
Employee$Department <- as.factor(Employee$Department)
Employee$Role <- as.factor(Employee$Role)

Exercise 1: Exploring Salary Distributions

a1. Plot to highlight salary differences across departments.

A boxplot is an excellent choice. We use the distr.plot.xy function from UBStats.

distr.plot.xy(x = Department, y = Salary, data = Employee, plot.type = "boxplot")

Justification: The boxplot clearly shows that the median salary (the thick line inside the box) and the overall range of salaries differ across departments. For instance, the IT department appears to have the highest median salary, while HR has the lowest.

a2. What is the range of salaries for the top 15% earners in the HR department?

We can use distr.summary.x to get the quantiles for the HR department’s salaries.

# Filter data for HR department first
hr_employees <- subset(Employee, Department == "HR")
# Get summary statistics, which include quantiles
summary_hr <- distr.summary.x(x = hr_employees$Salary)
##   n n.a min   q1 median    mean      q3  max     sd      var
##  48   0 736 1576   2091 2230.52 2864.75 4418 849.32 721346.1
print(summary_hr)
## $`Summary measures`
##    n n.a min   q1 median     mean      q3  max      sd      var
## 1 48   0 736 1576   2091 2230.521 2864.75 4418 849.321 721346.1
# The 85th percentile (p85) is the start of the top 15%
# CORRECTED CODE: We find the row where the 'measure' column is "p85" or "Max", 
# select the 'value' column, and ensure it's treated as a number.
top_15_start <- as.numeric(summary_hr$value[summary_hr$measure == "p85"])
max_salary <- as.numeric(summary_hr$value[summary_hr$measure == "Max"])

cat("\nThe range for the top 15% of earners in HR is from", round(top_15_start), "to", round(max_salary), "\n")
## 
## The range for the top 15% of earners in HR is from  to

a3. Which central tendency measures would you use to compare salary distributions?

The mean and the median are both useful. We can calculate these for each department using R’s built-in aggregate function.

# Calculate mean and median salary by department
print("Mean Salary by Department:")
## [1] "Mean Salary by Department:"
aggregate(Salary ~ Department, data = Employee, FUN = mean)
print("Median Salary by Department:")
## [1] "Median Salary by Department:"
aggregate(Salary ~ Department, data = Employee, FUN = median)

Considerations: The output tables provide the mean and median for each department. We can see that for most departments, the mean is higher than the median, indicating a right-skewed distribution (a few high earners pulling the average up). The median is a more robust measure of the “typical” salary in this case.

b. Testing HR Salaries

b1. Null and alternative hypotheses.

The company wants to verify if the HR salary is lower than the national average of 2500. This is a lower-tail test.

  • \(H_0: \mu_{HR} \ge 2500\)
  • \(H_1: \mu_{HR} < 2500\)

b2. Rejection region and conclusion.

We assume \(\sigma = 850\) and test at \(\alpha = 0.05\). We use the TEST.mean function from UBStats.

# The TEST.mean function handles the Z-test when sigma is provided.
# We save the results to an object to use them later.
hr_test_result <- TEST.mean(x = hr_employees$Salary, mu0 = 2500, sigma = 850, alternative = "less")
##   n    xbar sigma_X     SE stat p-value
##  48 2230.52     850 122.69 -2.2    0.01
print(hr_test_result)
##    n     xbar sigma_X       SE      stat    p-value
## 1 48 2230.521     850 122.6869 -2.196478 0.01402887
# Extract the numeric values into simple variables for inline reporting.
z_stat_value <- as.numeric(hr_test_result$stat)
p_value_hr <- as.numeric(hr_test_result$p.value)

Procedure and Conclusion: The TEST.mean function calculates the Z-statistic (stat) and the p-value. The rejection region for this lower-tail test at \(\alpha=0.05\) is any Z-statistic less than -1.645. The output shows a test statistic of -2.2 and a p-value of . Since the p-value is less than 0.05, we reject the null hypothesis. There is significant evidence that the average HR salary is lower than the national average. The company should evaluate increases in salaries.

b3. Define Type II error.

A Type II error occurs if we fail to reject a null hypothesis that is actually false. In this context, it would mean concluding that the HR salaries are not lower than the national average, when in reality they are lower. The probability of this error is \(\beta\).

b4. Probability of Type II error (\(\beta\)).

What is the probability that the test leads us to conclude the average salary is not lower than 2500, when it is actually 2400? This requires manual calculation as UBStats does not have a direct power/beta function.

n_hr <- nrow(hr_employees)
sigma <- 850
mu0 <- 2500
alpha <- 0.05
# Find the critical value of the sample mean
x_crit <- mu0 + qnorm(alpha) * (sigma / sqrt(n_hr))
cat("We fail to reject H0 if our sample mean is greater than:", round(x_crit, 2), "\n")
## We fail to reject H0 if our sample mean is greater than: 2298.2
# Now, calculate the probability of this happening if the true mean is 2400
mu_true <- 2400
z_for_beta <- (x_crit - mu_true) / (sigma / sqrt(n_hr))
beta <- 1 - pnorm(z_for_beta)

cat("The probability of a Type II error (beta) is:", round(beta, 4), "\n")
## The probability of a Type II error (beta) is: 0.7967
# The R function used is pnorm() to find the cumulative probability for a Z-score.

c. Department with the highest proportion of senior employees.

We use distr.table.xy to get conditional proportions.

distr.table.xy(x = Department, y = Role, data = Employee, freq = "prop", freq.type = "y|x")
## y|x: Proportions
##             Role
## Department   Junior Manager Senior TOTAL
##   HR           0.46    0.12   0.42  1.00
##   IT           0.53    0.00   0.47  1.00
##   Operations   0.59    0.06   0.35  1.00
##   Sales        0.40    0.10   0.50  1.00

By inspecting the ‘Senior’ column in the output table, we can identify the department with the highest proportion.

Exercise 2: Salary Progression and Roles

a1. 90% confidence interval for the difference in average salary of Senior and Junior employees.

We use the CI.diffmean function from UBStats.

seniors <- subset(Employee, Role == "Senior")
juniors <- subset(Employee, Role == "Junior")

CI.diffmean(x = seniors$Salary, y = juniors$Salary, conf.level = 0.90, type = "independent")
##               n_x n_y    xbar    ybar xbar-ybar    s_X    s_Y    se  Lower
## Normal.Approx 126 156 2871.49 2081.72    789.77 552.57 548.84 65.94 681.31
## Student-t     126 156 2871.49 2081.72    789.77 552.57 548.84 65.94 680.95
##                Upper
## Normal.Approx 898.23
## Student-t     898.59
##               n_x n_y    xbar    ybar xbar-ybar    s_X    s_Y    se  Lower
## Normal.Approx 126 156 2871.49 2081.72    789.77 552.57 548.84 65.99 681.24
## Student-t     126 156 2871.49 2081.72    789.77 552.57 548.84 65.99 680.86
##                Upper
## Normal.Approx 898.31
## Student-t     898.69

a2. Interpretation of the obtained confidence interval.

Interpretation: We are 90% confident that the true average salary for Senior employees is between [lower bound] and [upper bound] higher than the true average salary for Junior employees. Since the entire interval is positive and does not contain 0, this provides statistically significant evidence that Seniors earn more than Juniors on average.

b. Testing the company’s role distribution.

b1. Observed and expected frequencies.

We use distr.table.x to get the observed counts.

# Observed frequencies
observed_counts_table <- distr.table.x(Employee$Role)
##  Employee$Role Count Prop
##         Junior   156 0.52
##        Manager    18 0.06
##         Senior   126 0.42
##          TOTAL   300 1.00
print("Observed Frequencies:")
## [1] "Observed Frequencies:"
print(observed_counts_table)
##     Employee$Role Count Prop
## 1          Junior   156 0.52
## 2         Manager    18 0.06
## 3          Senior   126 0.42
## Sum         TOTAL   300 1.00
# Expected frequencies
target_dist <- c("Junior" = 0.50, "Manager" = 0.10, "Senior" = 0.40)
expected_counts <- n_employees * target_dist
print("Expected Frequencies:")
## [1] "Expected Frequencies:"
print(expected_counts)
##  Junior Manager  Senior 
##     150      30     120

b2. Test for deviation from the target distribution at \(\alpha=0.01\).

This is a Chi-Squared Goodness-of-Fit test. We use the standard chisq.test function.

# Create the observed_counts vector using the standard table() function.
observed_counts <- table(Employee$Role)

# Define the target distribution probabilities
target_dist <- c("Junior" = 0.50, "Manager" = 0.10, "Senior" = 0.40)

# CRITICAL FIX: Ensure the 'p' vector has the same elements in the same order as 'x'.
# We do this by reordering our target_dist vector to match the names/order of observed_counts.
target_dist_ordered <- target_dist[names(observed_counts)]

# Perform the test using the correctly ordered probability vector
test_result <- chisq.test(x = observed_counts, p = target_dist_ordered)
print(test_result)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_counts
## X-squared = 5.34, df = 2, p-value = 0.06925
# Conclusion
alpha <- 0.01
if(test_result$p.value < alpha) {
  cat("\nConclusion: The p-value is less than 0.01. We reject H0.\n")
  cat("There is significant evidence that the company is deviating from its target role distribution.\n")
} else {
  cat("\nConclusion: The p-value is not less than 0.01. We fail to reject H0.\n")
  cat("There is not significant evidence that the company is deviating from its target role distribution.\n")
}
## 
## Conclusion: The p-value is not less than 0.01. We fail to reject H0.
## There is not significant evidence that the company is deviating from its target role distribution.

Analytic Expression: The test statistic is calculated as \(\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\). The p-value is \(P(\chi^2_{df} \ge \chi^2_{obs})\), where \(df = k-1 = 3-1=2\).

Exercise 3: Remote Work Proportion

We want to assess if the proportion of remote workers has decreased compared to two years ago (\(\hat{p}_y = 0.5\) from \(n_y=100\)). We use TEST.diffprop from UBStats.

# We need to create two vectors for the function to represent the two samples
current_sample <- Employee$Remote_Work
past_sample <- c(rep(1, 50), rep(0, 50)) # Recreate the past sample based on the info

TEST.diffprop(x = current_sample, y = past_sample, success.x = 1, alternative = "less")
##  n_x n_y phat_x phat_y phat_x-phat_y  s_X s_Y se_0 stat p-value
##  300 100   0.65    0.5          0.15 0.48 0.5 0.06  2.6       1

Conclusion: The p-value from the output will tell us if the decrease is statistically significant.

Exercise 4: Modeling Employee Productivity

a. Linear model modA.

We use the standard lm function, as UBStats does not replace core modeling functions.

modA <- lm(Productivity ~ Training_Attended + Satisfaction + Hours_Worked + Tenure + Remote_Work + Salary, data = Employee)
summary(modA)
## 
## Call:
## lm(formula = Productivity ~ Training_Attended + Satisfaction + 
##     Hours_Worked + Tenure + Remote_Work + Salary, data = Employee)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.9870  -5.5782   0.5455   5.1784  19.6860 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       79.9309249  4.4986360  17.768  < 2e-16 ***
## Training_Attended  3.9203650  0.9022887   4.345 1.92e-05 ***
## Satisfaction       2.1945659  0.2888923   7.596 4.12e-13 ***
## Hours_Worked      -0.5578584  0.0872537  -6.394 6.38e-10 ***
## Tenure            -0.1566415  0.1119991  -1.399   0.1630    
## Remote_Work        0.8462149  0.9245213   0.915   0.3608    
## Salary             0.0016084  0.0006269   2.566   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.631 on 293 degrees of freedom
## Multiple R-squared:  0.3194, Adjusted R-squared:  0.3055 
## F-statistic: 22.92 on 6 and 293 DF,  p-value: < 2.2e-16

a1. Interpretation: The coefficient for Training_Attended (3.92) is the estimated average increase in productivity for attending training, holding all other variables constant. The t-statistic (4.34) and its p-value (1.9231e-05) show it is a significant predictor.

a2. Overall Significance: The F-statistic’s p-value (293) is extremely small, so the model modA is overall statistically significant.

b. Linear model modB.

b1. Estimate modB and report the coefficient and 99% CI for Satisfaction.

modB <- lm(Productivity ~ Training_Attended + Satisfaction + Hours_Worked + Tenure + Remote_Work + Salary + Department, data = Employee)
cat("Coefficient for Satisfaction in modB:", round(coef(modB)["Satisfaction"], 4), "\n")
## Coefficient for Satisfaction in modB: 2.2123
confint(modB, "Satisfaction", level = 0.99)
##                 0.5 %   99.5 %
## Satisfaction 1.459903 2.964736

b2. Average productivity difference between IT and Operations.

it_effect <- coef(modB)["DepartmentIT"]
operations_effect <- coef(modB)["DepartmentOperations"]
avg_diff <- it_effect - operations_effect
cat("Estimated average difference in productivity (IT - Operations):", round(avg_diff, 2), "points.\n")
## Estimated average difference in productivity (IT - Operations): -2.48 points.

Significance: A direct test of this difference is not provided in the summary. It requires a specific linear hypothesis test.

b3. The assumption of normality.

We check the assumption that the model’s residuals are normally distributed using a Q-Q plot.

plot(modB, which = 2)

Conclusion on Normality: The points on the plot fall reasonably close to the diagonal line, suggesting the normality assumption is met.


🎓 End of Lecture Series