Assignment2Regression.knit

Assignment 2

Dennis Espejo EDUC 7667: Regression and Analysis of Variance Eric Grau February 20, 2026

library("ggpubr")
library("ggplot2")
library("sf")
library("dplyr")
library("tidyr")
library("viridis")
library(patchwork)

Question 2

Question 2A

Figure 1: Regression line handdrawn sketches

All three diagrams represent a positive linear relationship, though the strength of the relationship varies across them. The first diagram shows the strongest positive linear relationship, the second a weaker one, and the third an even weaker one. ## Edit: Make sure to specify what makes them weaker/stronger.

Question 2B Part 1

library(knitr)

df <- read.csv("/Users/dennisespejo/Desktop/STATS HW - Sheet2.csv")

model <- lm(SBP ~ QUET, data = df)

kable(summary(model)$coefficients,
      caption = "Table 1: R Output for Regression of SBP on QUET")

Table 1: R Output for Regression of SBP on QUET
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	70.57640	12.321868	5.727736	3.0e-06
QUET	21.49167	3.545147	6.062278	1.2e-06

tinytex::is_tinytex()

## [1] TRUE

After plugging the data into R (pictured above) using the lm regression function, I get a slope of 21.49 and a y-intercept of 70.58

Making the regression line: \[ \widehat{\text{SBP}} = 70.58 + 21.49(\text{QUET}) \]

Question 2B Part 2

The following is a scatter plot produced by r with the regression of SBP regressed on QUET demonstrated.

plot(df$QUET, df$SBP, main = "First Scatter plot with regression line",
     xlab = "QUET", ylab = "SBP")




abline(model, col = "red", lwd = 2)

Figure 2: Plot with regression of SBP on QUET

The produced scatterplot slightly matches the initial regression line I sketched. Both this regression line and my sketch depict a linear relationship with a positive slope

Question 2B Part 3

The null hypothesis for the slope is \[ H_0: \beta_1 = \beta_1^{(0)}, \] where $\beta_1^{(0)}$ denotes the hypothesized value of the slope parameter (in this case, $0$). This hypothesis can be tested using the following test statistic:

\[ T = \frac{\hat{\beta}_1 - \beta_1^{(0)}} {\dfrac{S_{Y \mid X}}{S_X \sqrt{n - 1}}}. \]

predict_y <- function(x, B0, B1) {
  B0 + B1 * x
}


predict_y <- function(x, B0, B1) {
  B0 + B1 * x
}

B0 <- coef(model)[1]
B1 <- coef(model)[2]

df$predictedvalues <- predict_y(df$QUET, B0, B1)



df$squareddifference <- (df$SBP - df$predictedvalues)^2



SSE <- sum(df$squareddifference)

Sofyatxsquared <- SSE/30


Sofyatx <- sqrt(Sofyatxsquared)

meanofx <- mean(df$QUET)


df$QUETvariancefrommean <- (df$QUET - meanofx)^2


Squaredxvariance <- sum(df$QUETvariancefrommean) / 31

Xstandarderror <- sqrt(Squaredxvariance)

denom <- Xstandarderror * sqrt(31)

Standarderrorofslope <-Sofyatx/denom

tvalue= 21.49/Standarderrorofslope

tvalue

## [1] 6.061808

After performing the calculations above, the resulting test statistic was $t \approx 6.06$, which matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|6.06| > 2.042$, we reject $H_0$ and conclude that the slope is statistically significantly different from zero.

For the $y$-intercept, the test statistic is computed using

\[ t = \frac{\hat{\beta}_0 - \beta_0^{(0)}} {\text{SE}(\hat{\beta}_0)}, \]

where the numerator represents the estimated intercept, $\hat{\beta}_0$, minus the hypothesized intercept value, $\beta_0^{(0)}$ (in this case, $0$).

The standard error of the estimated intercept is given by

\[ \text{SE}(\hat{\beta}_0) = S_{Y \mid X} \sqrt{ \frac{1}{n} + \frac{\bar{x}^2}{(n-1) S_X^2} }, \]

where $S_{Y \mid X}$ denotes the estimated standard deviation of $Y$ given $X$, $\bar{x}$ is the sample mean of $X$, $S_X^2$ is the sample variance of $X$, and $n$ is the sample size.

tvalueforyintercept <- (70.57640-0)/(Sofyatx * sqrt( (1/32) + (meanofx^2/ ((32-1) * Squaredxvariance ) 
) ) )

tvalueforyintercept

## [1] 5.727735

After performing the calculations above, the resulting test statistic was $t \approx 5.506$, which roughly matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|5.506| > 2.042$, we reject $H_0$ and conclude that the y-intercept is statistically significantly different from zero.

Question 2B Part 4

The estimated slope of $\hat{\beta}_1 = 21.59$ indicates that for each one-unit increase in body size (QUET), predicted systolic blood pressure (SBP) increases by approximately $21.49$ units.

A hypothesis test of \[ H_0: \beta_1 = 0 \] versus \[ H_a: \beta_1 \neq 0 \] yields a p-value less than $0.05$. Therefore, we reject the null hypothesis and conclude that there is statistically significant evidence of a positive linear association between body size and blood pressure in the population.

Question 2B Part 5

The prediction band is calculated through using the following formula:

\[ \hat{y}_0 \pm t_{\alpha/2,\,n-2} \; s_{Y|X} \sqrt{ 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} } \]

newdata <- data.frame(
  QUET = seq(min(df$QUET), max(df$QUET), length.out = 100)
)


pred_intervals <- predict(model, newdata,
                          interval = "prediction",
                          level = 0.95)


plot(df$QUET, df$SBP,
     pch = 16,
     col = "darkgray",
     xlab = "QUET",
     ylab = "SBP",
     main = "Regression Line with 95% Prediction Band")


polygon(
  c(newdata$QUET, rev(newdata$QUET)),
  c(pred_intervals[, "lwr"], rev(pred_intervals[, "upr"])),
  col = rgb(0, 0, 1, 0.2),
  border = NA
)


lines(newdata$QUET, pred_intervals[, "fit"],
      col = "red", lwd = 2)

Figure 3: Plot with 95% Prediction Band

After using the above in functions in R we get the plotted predicition interval on our previous regression line plot

Question 2B Part 6

Using the prediction band equation (provided above) I hard coded each variable, and re-used some of the previous values I stored into an object such as “Sofyatx” and got the prediction intervals displayed below

predictionx=3.4


yestimate= predict_y(predictionx, B0, B1)

criticalt = 2.042

upperband <- yestimate + (criticalt * Sofyatx * sqrt(1+ (1/32) + ( (3.4-meanofx)^2 /Squaredxvariance ) )   )


lowerband <- yestimate - (criticalt * Sofyatx * sqrt(1+ (1/32) + ( (3.4-meanofx)^2 /Squaredxvariance ) )   )

upperband

## (Intercept) 
##    164.0613

lowerband

## (Intercept) 
##    123.2348

To check my calculations, I used Rs built in features to get the upper limit and lower limit

x0 <- data.frame(QUET = 3.4)

pred_x0 <- predict(model, newdata = x0, interval = "prediction", level = 0.95)

pred_x0

##        fit      lwr      upr
## 1 143.6481 123.2973 163.9989

We can see that Rs output from its built in feature, roughly matches my calculations making the prediction interval at 3.4 about 123.23 (lower) and 164.06 (upper)

Question 2B Part 7

As shown in the scatterplot of SBP versus QUET, the relationship appears approximately linear.

model <- lm(SBP ~ QUET, data = df)


plot(model$fitted.values, resid(model),
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals",
     pch = 19)

abline(h = 0, col = "red", lwd = 2)

The residuals versus fitted values plot does not display systematic curvature or a funnel-shaped pattern.

model <- lm(SBP ~ QUET, data = df)


qqnorm(resid(model),
       main = "Normal Q-Q Plot of Residuals",
       pch = 19)

qqline(resid(model), col = "red", lwd = 2)

Additionally, the Q–Q plot indicates approximate normality of residuals. Regarding the regression of SBP on QUET, none of the straight-line regression assumptions are violated.

Question 2C Part 1

library(knitr)

df <- read.csv("/Users/dennisespejo/Desktop/STATS HW - Sheet2.csv")

model <- lm(QUET ~ Age, data = df)

kable(summary(model)$coefficients,
      caption = "Table 1: R Output for Regression of QUET on Age")

Table 1: R Output for Regression of QUET on Age
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.3864519	0.4176903	0.9252115	0.362239
Age	0.0573642	0.0077799	7.3733738	0.000000

After plugging the data into R (pictured above) using the lm regression function, I get a slope of .057 and a y-intercept of .38

Making the regression line: \[ \widehat{\text{QUET}} = 0.38 + .057(\text{Age}) \]

Question 2C Part 2

The following is a scatter plot produced by r with the regression of QUET regressed on Age demonstrated.

plot(df$Age, df$QUET, main = "Scatter plot with regression line",
     xlab = "Age", ylab = "QUET")




abline(model, col = "red", lwd = 2)

Figure 2: Plot with regression of QUET on Age

The produced scatterplot slightly matches the initial regression line I sketched. Both this regression line and my sketch depict a linear relationship with a weaker positive slope compared to the first plot of SBP regressed on QUET.

Question 2C Part 3

The null hypothesis for the slope is \[ H_0: \beta_1 = \beta_1^{(0)}, \] where $\beta_1^{(0)}$ denotes the hypothesized value of the slope parameter (in this case, $0$).

This hypothesis can be tested using the following test statistic:

\[ T = \frac{\hat{\beta}_1 - \beta_1^{(0)}} {\dfrac{S_{Y \mid X}}{S_X \sqrt{n - 1}}}. \]

predict_y <- function(x, B0, B1) {
  B0 + B1 * x
}

B0 <- coef(model)[1]
B1 <- coef(model)[2]

df$predictedvalues <- predict_y(df$Age, B0, B1)



df$squareddifference <- (df$QUET - df$predictedvalues)^2



SSE <- sum(df$squareddifference)

Sofyatxsquared <- SSE/30


Sofyatx <- sqrt(Sofyatxsquared)

meanofx <- mean(df$Age)


df$Agevariancefrommean <- (df$Age - meanofx)^2


Squaredxvariance <- sum(df$Agevariancefrommean) / 31

Xstandarderror <- sqrt(Squaredxvariance)

denom <- Xstandarderror * sqrt(31)

Standarderrorofslope <-Sofyatx/denom

tvalue= B1/Standarderrorofslope

tvalue

##      Age 
## 7.373374

After performing the calculations above, the resulting test statistic was $t \approx 7.37$, which matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|7.37| > 2.042$, we reject $H_0$ and conclude that the slope is statistically significantly different from zero. Albeit, a relatively weak slope at .05.

For the $y$-intercept, the test statistic is computed using

\[ t = \frac{\hat{\beta}_0 - \beta_0^{(0)}}{\mathrm{SE}(\hat{\beta}_0)}. \]

The standard error of the estimated intercept is given by

\[ \mathrm{SE}(\hat{\beta}_0) = S_{Y\mid X} \sqrt{ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2} }, \]

where $S_{Y\mid X}$ is the estimated standard deviation of $Y$ given $X$, $\bar{x}$ is the sample mean of $X$, and $n$ is the sample size.

tvalueforyintercept <- (B0-0)/(Sofyatx * sqrt( (1/32) + (meanofx^2/ ((32-1) * Squaredxvariance ) 
) ) )

tvalueforyintercept

## (Intercept) 
##   0.9252115

After performing the calculations above, the resulting test statistic was $t \approx .925$, which matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|.925| is not > 2.042$, we fail to reject $H_0$ and conclude that the y-intercept is not statistically significantly different from zero.

Question 2C Part 4

In the regression of QUET on Age, the estimated slope coefficient is $\hat{\beta}_1 = 0.0574$ with $p < .001$. This indicates a statistically significant positive association between Age and QUET. However, the magnitude of the slope suggests that the relationship is substantively small, as each one-unit increase in Age is associated with only a 0.0574-unit increase in QUET.

Question 2D Part 1

The least squares estimate of the slope ($\hat{\beta}_1$) for the straight line regression of SBP (Y) on AGE (X) can be found using the following formula:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=1}^{n}(x_i - \bar{x})^2} \]

and the y-intercept ($\hat{\beta}_0$) can be calculated with:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

library(knitr)

df <- read.csv("/Users/dennisespejo/Desktop/STATS HW - Sheet2.csv")

model <- lm(SBP ~ Age, data = df)

kable(summary(model)$coefficients,
      caption = "Table 1: R Output for Regression of SBP on QUET")

Table 1: R Output for Regression of SBP on QUET
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	59.09163	12.8162615	4.610676	6.98e-05
Age	1.60450	0.2387159	6.721378	2.00e-07

After plugging the data into R (pictured above) using the lm regression function, I get a slope of 1.6 and a y-intercept of 59.09

Making the regression line: \[ \widehat{\text{SBP}} = 59.09 + 1.6(\text{Age}) \]

Question 2D Part 2

The following is a scatter plot produced by r with the regression of SBP regressed on QUET demonstrated.

plot(df$Age, df$SBP, main = "Scatter plot with regression line",
     xlab = "Age", ylab = "SBP")




abline(model, col = "red", lwd = 2)

Figure 2: Plot with regression of SBP regressed on Age

The produced scatterplot slightly matches the initial regression line I sketched. Both this regression line and my sketch depict a linear relationship with a positive slope

Question 2D Part 3

\[ T = \frac{\hat{\beta}_1 - \beta_1^{(0)}} {\dfrac{S_{Y \mid X}}{S_X \sqrt{n - 1}}}. \]

The numerator represents the difference between the estimated slope and the hypothesized value of the slope. In this case, the hypothesized value is $0$, so the numerator becomes $\hat{\beta}_1 - 0$.

The quantity $S_{Y \mid X}$ represents the estimated standard deviation of $Y$ given $X$, calculated as

\[ S_{Y \mid X} = \sqrt{ \frac{\sum (y_i - \hat{y}_i)^2}{n - 2} }. \]

The term $S_X$ denotes the sample standard deviation of $X$, given by

\[ S_X = \sqrt{ \frac{\sum (x_i - \bar{x})^2}{n - 1} }. \]

predict_y <- function(x, B0, B1) {
  B0 + B1 * x
}

B0 <- coef(model)[1]
B1 <- coef(model)[2]

df$predictedvalues <- predict_y(df$Age, B0, B1)



df$squareddifference <- (df$SBP - df$predictedvalues)^2



SSE <- sum(df$squareddifference)

Sofyatxsquared <- SSE/30


Sofyatx <- sqrt(Sofyatxsquared)

meanofx <- mean(df$Age)


df$Agevariancefrommean <- (df$Age - meanofx)^2


Squaredxvariance <- sum(df$Agevariancefrommean) / 31

Xstandarderror <- sqrt(Squaredxvariance)

denom <- Xstandarderror * sqrt(31)

Standarderrorofslope <-Sofyatx/denom

tvalue= B1/Standarderrorofslope

tvalue

##      Age 
## 6.721378

After performing the calculations above, the resulting test statistic was $t \approx 6.72$, which matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|6.72| > 2.042$, we reject $H_0$ and conclude that the slope is statistically significantly different from zero. Albeit, a relatively weak slope at .05.

For the $y$-intercept, the test statistic is computed using

\[ t = \frac{\hat{\beta}_0 - \beta_0^{(0)}} {\text{SE}(\hat{\beta}_0)}, \]

where the numerator represents the estimated intercept, $\hat{\beta}_0$, minus the hypothesized intercept value, $\beta_0^{(0)}$ (in this case, $0$).

The standard error of the estimated intercept is given by

\[ \text{SE}(\hat{\beta}_0) = S_{Y \mid X} \sqrt{ \frac{1}{n} + \frac{\bar{x}^2}{(n-1) S_X^2} }, \]

where $S_{Y \mid X}$ denotes the estimated standard deviation of $Y$ given $X$, $\bar{x}$ is the sample mean of $X$, $S_X^2$ is the sample variance of $X$, and $n$ is the sample size.

tvalueforyintercept <- (B0-0)/(Sofyatx * sqrt( (1/32) + (meanofx^2/ ((32-1) * Squaredxvariance ) 
) ) )

tvalueforyintercept

## (Intercept) 
##    4.610676

After performing the calculations above, the resulting test statistic was $t \approx 4.61$, which matches the value reported in the regression output, confirming the correctness of the manual computation.

The degrees of freedom for the test are \[ \text{df} = n - 2 = 32 - 2 = 30. \]

For a two-tailed test at the $\alpha = 0.05$ level, the critical value is \[ t_{0.025, 30} \approx 2.042. \]

Because $|4.61| > 2.042$, we reject $H_0$ and conclude that the y-intercept is significantly different from zero.

Question 2D Part 4

In the regression of SBP on Age, the estimated slope coefficient is $\hat{\beta}_1 = 1.6045$ with $p < .001$, indicating a statistically significant positive association. The coefficient implies that each one-unit increase in Age is associated with an average increase of 1.6045 units in SBP.

Question 2E Part 1

model <- lm(SBP ~ SMK, data = df)

kable(summary(model)$coefficients,
      caption = "Table 1: R Output for Regression of SBP on QUET")

Table 1: R Output for Regression of SBP on QUET
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	140.800000	3.661472	38.454477	0.0000000
SMK	7.023529	5.023498	1.398135	0.1723245

The least-squares estimates from the regression of SBP on SMK are:

\[ \hat{\beta}_1 = 7.02 \]

\[ \hat{\beta}_0 = 140.8 \]

Thus, the estimated regression equation is

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

Substituting the estimated values:

\[ \widehat{\text{SBP}} = 140.8 + 7.02(\text{SMK}) \]

Question 2E Part 2

tapply(df$SBP, df$SMK, mean, na.rm = TRUE)

##        0        1 
## 140.8000 147.8235

The value of $_0 $ is equal to the mean SBP for nonsmokers, and the mean of smokers (147.8235) is equal to $_0 $ + $\hat{\beta}_1$. When regressing SBP on the binary variable SMK (where $0 =$ non-smoker and $1 =$ smoker), the estimated regression model is

\[ \widehat{\text{SBP}} = \hat{\beta}_0 + \hat{\beta}_1(\text{SMK}). \]

Because SMK is coded as 0 and 1, the intercept represents the predicted value of SBP when $\text{SMK} = 0$. Thus,

\[ \hat{\beta}_0 = \bar{Y}_{SMK=0}, \]

which means the intercept equals the mean SBP for non-smokers.

When $\text{SMK} = 1$, the predicted value becomes

\[ \widehat{\text{SBP}} = \hat{\beta}_0 + \hat{\beta}_1, \]

which equals the mean SBP for smokers. Therefore,

\[ \hat{\beta}_1 = \bar{Y}_{SMK=1} - \bar{Y}_{SMK=0}. \]

Thus, when a binary predictor is used in linear regression, the intercept equals the mean of the reference group ($SMK=0$), and the slope equals the difference in group means.

Question 2E Part 3

For the regression of systolic blood pressure (SBP) on smoking status (SMK), we test whether the slope for SMK is zero:

\[ H_0:\beta_1 = 0 \qquad \text{vs.} \qquad H_a:\beta_1 \neq 0. \]

Using the simple linear regression model

\[ \text{SBP}_i = \beta_0 + \beta_1 \text{SMK}_i + \varepsilon_i, \]

the test statistic for the slope is

\[ t=\frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}. \]

From the fitted model (with $n=32$ observations), the estimated slope and its standard error are

\[ \hat{\beta}_1 = 7.0235, \qquad SE(\hat{\beta}_1)=5.023. \]

Therefore,

\[ t=\frac{7.0235}{5.023}=1.398. \]

The test uses a $t$ distribution with

\[ df = n-2 = 32-2 = 30. \]

The corresponding two-sided p-value is

\[ p = 0.172. \]

Because $p > 0.05$, we fail to reject $H_0$ at the $\alpha=0.05$ level. Thus, there is insufficient evidence that SBP differs by smoking status in this sample (i.e., insufficient evidence that the slope relating SBP to SMK is nonzero).

Question 2E Part 4

Yes. The test in part (e)(3) for the regression of SBP on SMK is equivalent to the usual two-sample pooled $t$ test for equality of two population means (assuming equal but unknown variances), because when $\text{SMK}\in\{0,1\}$ the simple linear regression model

\[ \text{SBP}_i=\beta_0+\beta_1\text{SMK}_i+\varepsilon_i \]

implies two fitted group means: \[ \widehat{\mu}_0=\hat{\beta}_0 \quad (\text{when } \text{SMK}=0), \qquad \widehat{\mu}_1=\hat{\beta}_0+\hat{\beta}_1 \quad (\text{when } \text{SMK}=1), \] so that \[ \hat{\beta}_1=\widehat{\mu}_1-\widehat{\mu}_0=\bar{Y}_1-\bar{Y}_0. \] Therefore, testing $H_0:\beta_1=0$ is the same as testing $H_0:\mu_1-\mu_0=0$.

From the data: \[ n_0=15,\quad \bar{Y}_0=140.8, \qquad n_1=17,\quad \bar{Y}_1=147.8235. \] Thus the difference in sample means is \[ \bar{Y}_1-\bar{Y}_0 = 147.8235-140.8 = 7.0235, \] which matches the regression slope $\hat{\beta}_1=7.0235$.

The pooled standard deviation is \[ s_p=\sqrt{\frac{(n_0-1)s_0^2+(n_1-1)s_1^2}{n_0+n_1-2}} =\sqrt{\frac{(15-1)(12.9018)^2+(17-1)(15.2120)^2}{30}} =14.1808. \] The standard error of the difference in means is \[ SE(\bar{Y}_1-\bar{Y}_0)=s_p\sqrt{\frac{1}{n_0}+\frac{1}{n_1}} =14.1808\sqrt{\frac{1}{15}+\frac{1}{17}} =5.0235. \] Hence, the pooled two-sample $t$ statistic is \[ t=\frac{\bar{Y}_1-\bar{Y}_0}{SE(\bar{Y}_1-\bar{Y}_0)} =\frac{7.0235}{5.0235} =1.3981, \] with degrees of freedom \[ df=n_0+n_1-2=15+17-2=30, \] and the two-sided p-value is \[ p=0.1723. \]

In the regression slope test, \[ t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} =\frac{7.0235}{5.0235} =1.3981, \qquad df=30, \qquad p=0.1723. \] Since the test statistic, degrees of freedom, and p-value match exactly, the regression test for $H_0:\beta_1=0$ is equivalent to the pooled two-sample $t$ test for $H_0:\mu_1=\mu_0$.

Question 4

Question 4 Part A

The figure above is my sketched regression line, with the line hitting the y-axis at around 52.273 (the y-intercept), and a negative slope.

Question 4 Part B

\[ \hat{Y} = 52.273 \text{ when } X = 0 \]

Since IQ = 0 is outside the observed range of the data, this represents extrapolation and has no practical interpretation in this context.

Question 4 Part C

Given \[ \hat{\beta}_1 = -0.249, \quad S_{Y \mid X} = 7.704, \quad S_X = 16.192, \quad n = 18, \]

the degrees of freedom are

\[ df = n - 2 = 18 - 2 = 16. \]

The standard error of the slope is

\[ SE(\hat{\beta}_1) = \frac{S_{Y \mid X}}{S_X \sqrt{n - 1}} = \frac{7.704}{16.192\sqrt{17}}. \]

Since

\[ \sqrt{17} = 4.123, \]

we obtain

\[ SE(\hat{\beta}_1) = \frac{7.704}{16.192(4.123)} = \frac{7.704}{66.75} = 0.115. \]

For a 95% confidence interval, the critical value is

\[ t_{0.025,16} = 2.12. \]

The margin of error is therefore

\[ 2.12(0.115) = 0.244. \]

Thus, the 95% confidence interval for the slope is

\[ \hat{\beta}_1 \pm t_{0.025,16} \, SE(\hat{\beta}_1) = -0.249 \pm 0.244, \]

which yields

\[ (-0.493,\,-0.005). \]

Question 4 Part D

Since $0$ is not contained in the interval $(-0.493,\,-0.005)$, we reject $H_0 : \beta_1 = 0$ at $\alpha = 0.05$.

There is statistically significant evidence that IQ is negatively associated with the delinquency index.

Question 4 Part E

When the outlier is removed:

\[ \hat{\beta}_0 = 70.846, \quad \hat{\beta}_1 = -0.444. \]

The slope becomes more negative (from -0.249 to -0.444), indicating a stronger negative relationship.

Thus, the outlier weakens the magnitude of the estimated IQ–DI relationship.

Question 4 Part F

Given: \[ \hat{\beta}_1 = -0.444, \quad S_{Y|X} = 4.933, \quad S_X = 14.693, \quad n = 17. \]

Degrees of freedom: \[ df = 15. \]

Standard error: \[ SE(\hat{\beta}_1) = \frac{4.933}{14.693\sqrt{16}} = \frac{4.933}{14.693(4)} = \frac{4.933}{58.772} = 0.084. \]

Test statistic: \[ t = \frac{-0.444}{0.084} = -5.29. \]

Critical value: \[ t_{.025,15} = 2.13. \]

Since |t| = 5.29 > 2.13, we reject H_0.

There is strong evidence of a negative relationship.

Question 4 Part G

Since the slope is negative and statistically significant, we conclude that the delinquency index decreases as IQ increases.

The evidence becomes even stronger when the outlier is removed.

Question 8

Question 8 Part A

library(knitr)

df2 <- read.csv("/Users/dennisespejo/Desktop/question8_sal_cgpa 2.csv")

model2 <- lm(SAL ~ CGPA, data = df2)



kable(summary(model2)$coefficients,
      caption = "Table: R Output for Regression of SAL on CGPA")

Table: R Output for Regression of SAL on CGPA
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	435.9236	1337.8597	0.3258365	0.7469707
CGPA	3630.5613	465.7687	7.7947722	0.0000000

\[ \hat{\beta}_1 = 3630.561, \qquad \hat{\beta}_0 = 435.924. \]

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

\[ \boxed{\hat{Y} = 435.924 + 3630.561X} \]

The positive slope indicates that salary increases as CGPA increases.

plot(df2$CGPA, df2$SAL, main = "First Scatter plot with regression line",
     xlab = "CGPA", ylab = "SAL")




abline(model2, col = "red", lwd = 2)

Figure Plot with regression of SAL on CGPA

The scatterplot shows a strong positive relationship between cumulative grade-point average (CGPA) and starting salary.

Question 8 Part B

Based on the scatterplot above, the assumption of linearity does not appear to be fully satisfied. Although there is a positive relationship between CGPA and starting salary, the pattern of points shows noticeable curvature. The salaries increase quickly at lower CGPA values, but the rate of increase appears to slow down at higher CGPA levels, creating a leveling effect.

This pattern suggests that the true relationship may be nonlinear rather than strictly linear. A curved model may better capture the relationship between CGPA and starting salary.

Question 8 Part C

confint(model2, "CGPA", level = 0.95)

##         2.5 %   97.5 %
## CGPA 2676.477 4584.645

The 95% confidence interval for the slope $\beta_1$ is computed using

\[ \hat{\beta}_1 \pm t_{.025,\,n-2} \, SE(\hat{\beta}_1). \]

For this model,

\[ n = 30, \]

so the degrees of freedom are

\[ df = n - 2 = 28. \]

The standard error of the slope is

\[ SE(\hat{\beta}_1) = \frac{s}{\sqrt{S_{xx}}}, \qquad s = \sqrt{\frac{SSE}{n-2}}, \qquad S_{xx} = \sum (x_i - \bar{x})^2. \]

Using the computed values from the dataset,

\[ \hat{\beta}_1 = 3630.561, \qquad SE(\hat{\beta}_1) = 465.769, \qquad t_{.025,28} = 2.048. \]

Thus, the 95% confidence interval is

\[ 3630.561 \pm 2.048(465.769) = 3630.561 \pm 954.084, \]

\[ (2676.477,\;4584.645). \]

This means we are 95% confident that for each one-point increase in CGPA, the mean salary increases by between approximately $2,676 and $4,585.

Question 8 Part D

b1  <- coef(model2)["CGPA"]
se1 <- summary(model2)$coefficients["CGPA","Std. Error"]

df  <- df.residual(model2)


t_value <- (b1 - 4000) / se1


p_value <- 2 * pt(abs(t_value), df = df, lower.tail = FALSE)

t_value

##       CGPA 
## -0.7931806

p_value

##      CGPA 
## 0.4343428

\[ H_0:\beta_1 = 4000 \qquad H_a:\beta_1 \neq 4000 \]

\[ \hat{\beta}_1 = 3630.561, \qquad SE(\hat{\beta}_1) = 465.769, \qquad df = 28. \]

The test statistic is

\[ t = \frac{\hat{\beta}_1 - 4000}{SE(\hat{\beta}_1)} = \frac{3630.561 - 4000}{465.769} = -0.793. \]

The critical value at $\alpha = 0.05$ is

\[ t_{.025,28} = 2.048. \]

Since

\[ |t| = 0.793 < 2.048, \]

we fail to reject $H_0$

Question 8 Part E

x_seq <- seq(min(df2$CGPA), max(df2$CGPA), length.out = 100)

conf_band <- predict(model2,
                     newdata = data.frame(CGPA = x_seq),
                     interval = "confidence",
                     level = 0.95)


pred_band <- predict(model2,
                     newdata = data.frame(CGPA = x_seq),
                     interval = "prediction",
                     level = 0.95)


plot(df2$CGPA, df2$SAL,
     main = "Regression with 95% Confidence and Prediction Bands",
     xlab = "CGPA",
     ylab = "SAL",
     pch = 19)


lines(x_seq, conf_band[,"fit"], col = "red", lwd = 2)


lines(x_seq, conf_band[,"lwr"], col = "blue", lty = 2, lwd = 2)
lines(x_seq, conf_band[,"upr"], col = "blue", lty = 2, lwd = 2)


lines(x_seq, pred_band[,"lwr"], col = "darkgreen", lty = 3, lwd = 2)
lines(x_seq, pred_band[,"upr"], col = "darkgreen", lty = 3, lwd = 2)

legend("topleft",
       legend = c("Regression Line",
                  "95% Confidence Band",
                  "95% Prediction Band"),
       col = c("red", "blue", "darkgreen"),
       lty = c(1,2,3),
       lwd = 2)

The 95% confidence band for the mean response $\mu_{Y|X=x}$ is

\[ \hat{y}(x) \pm t_{.025,\,n-2}\, s \sqrt{ \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} }. \]

The 95% prediction band for a new observation $Y_{\text{new}}$ at $X=x$ is

\[ \hat{y}(x) \pm t_{.025,\,n-2}\, s \sqrt{ 1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}} }. \]

Where

\[ \hat{y}(x) = \hat{\beta}_0 + \hat{\beta}_1 x, \qquad s = \sqrt{\frac{SSE}{n-2}}, \qquad S_{xx} = \sum (x_i - \bar{x})^2. \]

For this model,

\[ n = 30, \qquad df = 28, \qquad t_{.025,28} = 2.048. \]

The confidence band is narrower because it estimates the mean salary at a given CGPA.

The prediction band is wider because it accounts for both the uncertainty in estimating the mean and the individual variability of salaries around the regression line.

Question 8 Part F

Testing

\[ H_0 : \mu_{Y|X_0} = 11{,}500 \qquad \text{versus} \qquad H_a : \mu_{Y|X_0} \neq 11{,}500, \]

\[ X_0 = 2.75. \]

Using the fitted regression equation,

\[ \hat{Y} = 435.924 + 3630.561X, \]

the estimated mean starting salary at $X_0 = 2.75$ is

\[ \hat{y}(2.75) = 435.924 + 3630.561(2.75) = 10419.967. \]

The standard error of the estimated mean response at $X_0$ is

\[ SE\!\left(\hat{y}(2.75)\right) = 209.425. \]

Thus, the test statistic is

\[ t = \frac{\hat{y}(2.75) - 11500}{SE\!\left(\hat{y}(2.75)\right)} = \frac{10419.967 - 11500}{209.425} = -5.157. \]

Since $n = 30$, the degrees of freedom are

\[ df = n - 2 = 28, \]

and the critical value at the $\alpha = 0.05$ level is

\[ t_{.025,28} = 2.048. \]

Because

\[ |t| = 5.157 > 2.048, \]

we reject $H_0$ at the 0.05 level.

There is sufficient statistical evidence to conclude that the mean starting salary at a CGPA of 2.75 is not equal to $11,500. The fitted model suggests that the expected salary at this CGPA is substantially lower than $11,500.