BIOS 507 – Homework 3

Author

Elaina Sinclair

Published

February 10, 2026

Background

library(tidyverse)
library(ggplot2)
library(prettyR)

Problem 1

For each of the situations below, define the predictor variable and the response variable, write out the population-level model, and interpret the slope. Note that since you will not have “numbers” for the slope, you can just interpret everything in terms of “β1”

A researcher is interested in the relationship between salary and number of years of experience in software development jobs. They collect data on 100 developers with a range of 5–20 years of experience, and measure each person’s salary in thousands of dollars.

Answer: Y_i = \beta_0 +\beta_1*X_i +\epsilon_i

Response variable = Y_i: Software developer salary

Predictor variable = X_i: Software developer experience (in years, from 5 to 20)

Slope = \beta_1; the average change in salary (in thousands of dollars) associated with a one-unit (one-year) increase in experience X_i (from 5-20 years), for software developers in the population of interest.

A hardware store chain is interested in running a new sales promotion on refrigerators, and they are trying to assess the relationship between the sale price and total refrigerator units sold. The company selects 40 stores across the country and has each of them apply a different discount with values ranging from 100 to 1000 dollars. The sales promotion is continued for 8 weeks, and at the end of the 8 weeks the total number of refrigerators sold is measured.

Answer: Y_i = \beta_0 +\beta_1*X_i +\epsilon_i

Response variable = Y_i; the number of refrigerators sold

Predictor variable = X_i; the sales price discount (in dollars, from 100 to 1,000)

Slope = \beta_1, the average change in refrigerator units sold associated with a one unit (one-dollar) increase in discount value X_i (from 100-1,000 dollars), for refrigerators sold in the 40 participating stores over the 8 week sales promotion period (study period).

Problem 2

On Canvas, you have a data set called solar.txt containing data collected during a solar energy project at Georgia Tech. The data contain several columns, but for now we are going to focus on heat flux (column labeled Y ) measured in kilowatts and radial deflection of the deflected rays (column labeled X4) measured in milliradians. The researchers are interested in using the radial deflection to predict the heat flux.

Read in the data:

solar <- read.table("C:/Users/esincl3/OneDrive - Emory/Documents/PhD Spring 2026/BIOS 507/solar.txt", header = TRUE, sep = " ")

What exploratory analyses should you do using the data? Conduct these and report your findings as well as any supporting figures.

Answer: Exploratory analyses should include the mean, median, and range of the dependent (Heat Flux) and independent (Deflected Rays) variables, as well as a scatter plot to visualize their relationship.

Heat Flux (solar$y)

mean(solar$y)

[1] 249.6379

median(solar$y)

[1] 257.9

range(solar$y)

[1] 181.5 278.7

hist(solar$y)

Deflected Rays (solar$x4)

mean(solar$x4)

[1] 16.70207

median(solar$x4)

[1] 16.45

range(solar$x4)

[1] 15.54 19.05

hist(solar$x4)

Scatter Plot: Heat Flux and Deflected Rays

ggplot(data = solar, aes(x = x4, y = y)) +
  geom_point() +
  labs(
    x = "Deflected Rays",
    y = "Heat Flux",
    title = "Scatter plot of Heat Flux vs Deflected Rays"
  ) +
  theme_minimal()

Write out the assumed regression model for Y. What are your assumptions about the model error?

Answer: Assumed regression model: Y_i = \beta_0 +\beta_1*X_i +\epsilon_i
Y_i = Heat Flux
X_i = Deflected Rays
\beta_0 = Intercept
\beta_1 = Slope, change in heat flux by one-unit increase in deflected rays
\epsilon_i = error term

Assumptions about error:
1. We assume the variance of \epsilon_i is the same for all Y (homoscedasticity)
2. For each value of x4 (Deflected Rays), Y (Heat Flux) is a univariate random variable with a finite mean and variance
3. Each observation is independent
4. A linear model is appropriate
5. The Y’s (Heat Flux) are normally distributed

Fit the model using R or SAS. Write out the estimated model.

model1 <- lm(y ~ x4, data = solar)

summary(model1)


Call:
lm(formula = y ~ x4, data = solar)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.2487  -4.5029   0.5202   7.9093  24.5080 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  607.103     42.906  14.150 5.24e-14 ***
x4           -21.402      2.565  -8.343 5.94e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.33 on 27 degrees of freedom
Multiple R-squared:  0.7205,    Adjusted R-squared:  0.7102 
F-statistic: 69.61 on 1 and 27 DF,  p-value: 5.935e-09

Answer \beta_0 = 607.103 / \beta_1 = -21.402 /

By “hand”:

Slope estimate:

# Get the coef estimates
X = cbind(1, solar$x4) # create the model matrix
Y = solar$y # call this Y to be consistent with our notation
betahat = solve(t(X) %*% X) %*% t(X) %*% Y # MLE
betahat

          [,1]
[1,] 607.10327
[2,] -21.40246

Regression model: Y_i = 607.103 -21.402*X_i

Carry out a hypothesis test to test the null hypothesis that the slope is 0. Be sure to write out the α, the null and alternative hypothesis, the test statistic, critical value, and your final decision. Interpret the result in the context of the study.

Get the critical value:

N = length(solar$y) # sample size
df = N-2
qt (0.975, df=df)

[1] 2.051831

Answer:
\alpha = 0.05
H_0: \beta_1 = 0
H_A: \beta_1 != 0
Test statistic (T) = -8.343 (obtained from summary() above)
Critical value (t) = 2.051831
Decision: Given that T is not within the interval [-t, t], we reject the null hypothesis.

Interpretation: Deflected Rays are statistically significant in determining Heat Flux. For every 21.4 unit decrease in deflected rays, heat flux increases by 1 unit (kW).

Find and interpret a 99% confidence interval for the slope.

confint(model1, level = 0.99)

                0.5 %    99.5 %
(Intercept) 488.22411 725.98242
x4          -28.50995 -14.29497

Interpretation: We can be 99% confident that the true slope is between -28.51 and -14.29; for every one-unit increase in Deflected Rays, Heat Flux decreases between -28.51 and -14.29 units.

Find and interpret a 95% confidence interval for the mean heat flux when the radial deflection is 16.5 milliradians.

predict(model1, newdata = data.frame(x4 = 16.5), interval = "confidence", level = 0.95)

       fit      lwr      upr
1 253.9627 249.1468 258.7787

Interpretation: We can be 95% confident that when radial deflection is 16.5 milliradians, the mean heat flux is between 249.15 kW and 258.78 kW.

The lab would like to predict the heat flux when the radial deflection is 16.5 milliradiansfor a new measurement. Give a 95% prediction interval on the kilowatts.

predict(model1, newdata = data.frame(x4 = 16.5), interval = "prediction", level = 0.95)

       fit     lwr      upr
1 253.9627 228.214 279.7114

Interpretation: When a specific radial deflection observation is 16.5 milliradians, the model predicts heat flux will be 253.96 kW, and we are 95% confident that the heat flux value for that observation is between 228.21 kW and 279.71 kW.

Which interval is wider? Why?

Answer: The prediction interval is wider because we’re trying to predict the value of a specific observation for heat flux when a given radial deflection value is 16.5. In contrast, a confidence interval predicts the range of the true mean heat flux when radial deflection is 16.5; in other words, the confidence interval is estimating the average heat flux value when radial deflection is 16.5, not a specific single observation. A prediction interval is wider because there is less certainty about a specific measurement due to individual variation of observations, in contrast to more certainty about an average outcome.

Problem 3

This example is adapted from “A modern approach to regression with R” by Simon Sheather. The manager of the purchasing department of a large company is interested in developing a regression model to predict the average amount of time it takes to process a given number of invoices. Data were collected over a period of 30 days. For each data point, information was collected on:

• The number of invoices processed (Invoices in the dataset)

• The number of hours it took to process the set of invoices (Time in the dataset)

The data are provided on Canvas in invoices.txt.

Read in the data:

invoice <- read.table("C:/Users/esincl3/OneDrive - Emory/Documents/PhD Spring 2026/BIOS 507/invoices.txt", header = TRUE, sep = "\t")

What exploratory analyses should you do using the data? Conduct these and report your findings as well as any supporting figures.

Answer: Exploratory analyses should include the mean, median, and range of the dependent (Invoices) and independent (Time) variables, as well as a scatter plot to visualize their relationship.

Invoices (invoice$Invoices)

mean(invoice$Invoices)

[1] 130.0333

median(invoice$Invoices)

[1] 127.5

range(invoice$Invoices)

[1]  23 289

hist(invoice$Invoices)

Processing Time (invoice$Time)

mean(invoice$Time)

[1] 2.11

median(invoice$Time)

[1] 2

range(invoice$Time)

[1] 0.8 4.1

hist(invoice$Time)

Scatter Plot: Invoices and Processing Time

ggplot(data = invoice, aes(x = Invoices, y = Time)) +
  geom_point() +
  labs(
    x = "Invoices",
    y = "Time to Process",
    title = "Scatter plot of Invoices vs Time to Process"
  ) +
  theme_minimal()

Write out the assumed regression model for Y . What are your assumptions about the model error?

Answer: Assumed regression model: Y_i = \beta_0 +\beta_1*X_i +\epsilon_i
Y_i = Time to process
X_i = Invoices
\beta_0 = Intercept
\beta_1 = Slope, change in time to process by one-unit increase in invoices
\epsilon_i = error term

Assumptions about error:
1. We assume the variance of \epsilon_i is the same for all Y (homoscedasticity)
2. For each value of x (Invoices), Y (Time to process) is a univariate random variable with a finite mean and variance
3. Each observation is independent
4. A linear model is appropriate
5. The Y’s (Time to process) are normally distributed

Fit the model using R or SAS. Write out the estimated model.

model2 <- lm(Time ~ Invoices, data = invoice)

summary(model2)


Call:
lm(formula = Time ~ Invoices, data = invoice)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59516 -0.27851  0.03485  0.19346  0.53083 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.6417099  0.1222707   5.248 1.41e-05 ***
Invoices    0.0112916  0.0008184  13.797 5.17e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3298 on 28 degrees of freedom
Multiple R-squared:  0.8718,    Adjusted R-squared:  0.8672 
F-statistic: 190.4 on 1 and 28 DF,  p-value: 5.175e-14

\beta_0 = 0.6417
\beta_1 = 0.01129

By “hand”:

Slope estimate:

# Get the coef estimates
X = cbind(1, invoice$Invoices) # create the model matrix
Y = invoice$Time # call this Y to be consistent with our notation
betahat = solve(t(X) %*% X) %*% t(X) %*% Y # MLE
betahat

           [,1]
[1,] 0.64170988
[2,] 0.01129164

Regression model: Y_i = 0.6417 + 0.01129 *X_i

Carry out a hypothesis test to test the null hypothesis that the slope is 0. Be sure to write out the α, the null and alternative hypothesis, the test statistic, critical value, and your final decision. Interpret the result in the context of the study.

Get the critical value:

N = length(invoice$Invoices) # sample size
df= N-2
qt (0.975, df = df)

[1] 2.048407

Answer:
\alpha = 0.05
H_0: \beta_1 = 0
H_A: \beta_1 != 0
Test statistic (T) = 13.797 (obtained from summary() above)
Critical value (t) = 2.048407
Decision: Given that T is not within the interval [-t, t], we reject the null hypothesis.

Interpretation: The number of invoices to be processed is statistically significant in determining the processing time. For every 13.8 unit increase in Invoices, the Processing Time increases by 1 time unit (days?).

Find and interpret a 99% confidence interval for the slope.

confint(model2, level = 0.99)

                  0.5 %    99.5 %
(Intercept) 0.303843730 0.9795760
Invoices    0.009030185 0.0135531

Interpretation: We can be 99% confident that the true slope is between 0.009 and 0.0136; we can be 99% confident that for every one unit increase in Invoices, the Time to Process the invoices increases between 0.009 and 0.136 time units (days?).

Find and interpret a 95% confidence interval for the amount of time it would take to process a stack of 160 invoices.

predict(model2, newdata = data.frame(Invoices = 160), interval = "confidence", level = 0.95)

       fit      lwr      upr
1 2.448373 2.315203 2.581543

Interpretation: We can be 95% confident that when Invoices = 160, the true mean Processing Time is between 2.32 and 2.58 time units (days?).

Find and interpret a 95% prediction interval for the amount of time it would take to process a new stack of 160 invoices.

predict(model2, newdata = data.frame(Invoices = 160), interval = "prediction", level = 0.95)

       fit      lwr      upr
1 2.448373 1.759861 3.136884

Interpretation: The model predicts that when Invoices = 160 for a new observation, the Processing Time is 2.45 time units (days?), and we can be 95% confident that the Processing Time for that observation lies between 1.76 and 3.14 time units (days?).

Which interval is wider? Why?

Answer: The prediction interval is wider than the confidence interval because for the prediction interval we’re trying to predict the Processing Time value of a new observation when the Invoice value is 160. In contrast, a confidence interval predicts the range of the true mean Processing Time when Invoices is 160; in other words, the confidence interval is estimating the average Processing Time value when Invoices is 160, not a specific single observation. A prediction interval is wider because there is less certainty about a specific measurement due to individual variation of observations, in contrast to more certainty about an average outcome.

OPTIONAL Problem 4

If you complete this problem, you will get 1 bonus point on your first midterm for each part you answer correctly (2 points possible).

Consider the simple linear model

yi = β0 + β1xi + ϵi, for i = 1, . . . , N with E(ϵi) = 0 and V ar(ϵi) = σ2 and the ϵ are uncorrelated. Assume that we are using the OLS estimators for β0 and β1.

It can be shown that Cov(¯y, ˆβ1) = 0. Provide some explanation as to why this must be true. Alternatively, you may prove it using properties of covariances.

In simple linear regression, data are represented in two primary components: the dependent variable (y) and the linear trend of y’s relationship with the independent variable (x), denoted by the slope estimator, \beta_1. The slope estimator \beta_1 is calculated using deviations from the mean value of y, and depends only on the values of y_i - \bar{y} (the difference between an observation and \bar{y}). It is not dependent on the value of (\bar{y}) itself. \bar{y} and \beta_1 are independent; \bar{y} is independent of the random variation that determines \beta_1, and vice versa. Therefore the covariance of \bar{y} and \beta_1 is 0 (they are independent).

It can be shown that Cov( ˆβ0, ˆβ1) = −σ2 ¯xPN i=1(xi−¯x)2 = −¯xV ar( ˆβ1).

What does this say about the relationship between the mean value for the predictor and the covariance between ˆβ0 and ˆβ1?

This expression shows that the covariance between the intercept and slope estimators is determined by the mean of the predictor (\bar{x}). If \bar{x} = 0, the covariance is 0 and the intercept and slope estimator \beta_1 are uncorrelated. When \bar{x} is NOT 0, the intercept and slope ARE correlated, with the sign of covariance being opposite to the sign of \bar{x}; a larger positive \bar{x} implies a more negative covariance –> over repeated samples, an increase in the estimated slope (\beta_1), tends to be accompanied by a decrease in the estimated intercept (\beta_0), and vice versa. Therefore, the farther the predictor mean (\bar{x}) is from zero, the stronger the linear dependence of \beta_0 and \beta_1. Said another way, when the data is centered far from zero, any changes in the slope (\beta_1) have to be offset by changes in the intercept (\beta_0) to make sure the model still has a good fit over the observed range of x.

The expression above implies that it is possible to “force” the covariance between ˆβ0 and ˆβ1 to be zero. How would you do thi Hint: it involves processing the X in some way prior to fitting the model)?

Building on part a), you can make the covariance of \beta_0 and \beta_1 be 0 if you center the predictor variable, x, BEFORE fitting the model. Centering the predictor means you shift the model so that the MEAN predictor is 0 (\bar{x} = 0). When you do this, the slope is independent of the intercept because the intercept now represents the mean response of y at x = \bar{x} rather than x = 0.