Assignment A

Simple linear regression

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.
Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.
The assignment is worth 100 points, and is due on 7th October 2023 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:

Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
Final answers of each question are written clearly (1 pt).
The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

A.1

The first step in using the capital asset pricing model (CAPM) is to estimate the stock’s beta \((\beta)\) using the market model. The market model can be written as:

\(R_{it} = \alpha_i + \beta_iR_{mt} + \epsilon_{it},\)

where \(R_{it}\) is the excess return for security \(i\) at time \(t\), \(R_{mt}\) is the excess return on a proxy for the market portfolio at time \(t\), and \(\epsilon_t\) is an iid random disturbance term. The coefficient beta in this case is the CAPM beta for security \(i\).

Suppose that you had estimated \(\beta\) for a stock as \(\hat{\beta}=1.147\). The standard error associated with this coefficient \(SE(\hat{\beta})\) is estimated to be 0.0548. A city analyst has told you that this security closely follows the market, but that it is no more risky, on average, than the market. This can be tested by the null hypotheses that the value of beta \((\beta)\) is one. The model is estimated over 62 daily observations. Test this hypothesis against a one-sided alternative that the security is more risky than the market \((\beta>1)\). Consider Type 1 error \((\alpha)\) as \(1\%\). Write down the null and alternative hypothesis. What do you conclude?

Does your conclusion change if you consider the Type 1 error \((\alpha)\) as \(0.1\%\)?

(4 + 1 = 5 points)

This question was completed on paper.

When asked to state the simple linear regression model, a students wrote it as follows: \(E(Y_i) = \beta_0 + \beta_1X_1 + \epsilon_i\). Is this correct? Justify your answer.

(2 points)

This question was completed on paper.

Consider the simple linear regression model below:

\(Y_i = \beta_0 + \beta_1X_1 + \epsilon_i, i = 1,...,n\)

where:

\(\beta_0 = 100, \beta_1 = 20,\) and \(\sigma^2 =5\). The following assumptions are made for the model:

A. \(E(\epsilon_i) = 0,\)

B. \(Var(\epsilon_i) = \sigma^2,\)

C. \(Cov(\epsilon_i, \epsilon_j)=0 \ \forall i, j; i\ne j\)

An observation \(Y\) is made for \(X=5\)

Can you state the exact probability that \(Y\) will fall between 195 and 205? If yes, then compute the probability. If not, then state any reasonable assumption(s) you need to make to compute the probability, and then compute the probability.

(1 + 1 + 4 = 6 points)

This question was completed on paper.

A.4

The Toluca Company manufactures refrigeration equipment in lots of varying sizes. The dataset toluca.txt consists of of two columns - LotSize and WorkHours required to produce the lot.

When asked for a point estimate of the expected work hours for lot sizes of 30 pieces, a person gave the estimate 202 because that is the mean number of WorkHours for the three observations of LotSize = 30 pieces in the dataset. Is there an issue with this approach? Explain. If there is an issue, then suggest a better approach and use it to estimate the expected work hours for lot sizes of 30 pieces.

(2 + 2 + 4 = 8 points)

Yes, there is an issue with this approach. Instead of taking the mean of these three points, we need to find \(\hat{Y}\), the point estimator of the mean response. \(\hat{Y}=\beta_0+\beta_1(X_i)\) where \(i=1,...,n\)

We will need to find \(\beta_0\) and \(\beta_1\) to plug into the equation \(\hat{Y}=\beta_0+\beta_1(30)\).

toluca <- read.table('./toluca.txt')
toluca <- toluca[-c(1), ]
colnames(toluca) <- c("LotSize", "WorkHours")
chars <- sapply(toluca, is.character)
toluca[ , chars] <- as.data.frame(apply(toluca[ , chars], 2, as.numeric))
model <- lm(WorkHours ~ LotSize, data = toluca)
model$coefficients

(Intercept)     LotSize 
  62.365859    3.570202

\(\beta_0\) = 62.365859

\(\beta_1\) = 3.570202

We can then plug these values into the equation \(\hat{Y}=\beta_0+\beta_1(30)\) to find the point estimate. This results in a point estimation of 169.472.

A.5

Consider the simple linear regression model below:

\(\log(Y)=\beta_0+\beta_1\log(X)+\epsilon\)

Interpret the coefficient \(\beta_1\), where you mention the approximate expected percentage increase in \(Y\) given an increase of \(1\%\) in \(X\).

Use the approximation: \(\log(1+x) = x\) if \(x<<1\)

(5 points)

This question was completed on paper.

A.6

The dataset ACT_GPA consists of the GPA at the end of freshmen year (\(Y\)) that can be predicted from the ACT score (\(X\)) of students of a college.

A.6.1

Obtain the least square estimates of the regression coefficients, and error standard deviation, and state the estimated regression function.

(5 points)

act_gpa <- read.table('./ACT_GPA.txt')
colnames(act_gpa) <- c("GPA", "ACT")
act_gpa <- act_gpa[-c(1), ]
chars <- sapply(act_gpa, is.character)
act_gpa[ , chars] <- as.data.frame(apply(act_gpa[ , chars], 2, as.numeric))
model2 <- lm(GPA~ACT, data = act_gpa)
model2$coefficients

(Intercept)         ACT 
 2.11404929  0.03882713

\(\beta_0\) is estimated to be 2.11405. \(\beta_1\) is estimated to be 0.03883.

The error standard deviation is estimated to be 0.6231.

The estimated regression function is \(Y_i=2.11405+0.03883(X_i)\)

A.6.2

Obtain the maximum likelihood estimate of the error standard deviation. Is it the same as that obtained in the previous question? Why or why not? If it isn’t same, which estimate will you prefer - the MLE or the one obtained in the previous question and why?

(2 + 2 + 4 = 8 points)

act_gpa <- read.table('./ACT_GPA.txt')
colnames(act_gpa) <- c("GPA", "ACT")
act_gpa <- act_gpa[-c(1), ]
chars <- sapply(act_gpa, is.character)
act_gpa[ , chars] <- as.data.frame(apply(act_gpa[ , chars], 2, as.numeric))
ACT <- act_gpa[,2]
GPA <- act_gpa[,1]
LL <- function(beta0, beta1, mu, sigma){
  R = GPA - ACT * beta1 - beta0
  R = suppressWarnings(dnorm(R, mu, sigma, log = TRUE))
  -sum(R)
}
fit <- stats4::mle(LL, start=list(beta0 = 2, beta1 = .04, mu = 20, sigma = .6))
fit


Call:
stats4::mle(minuslogl = LL, start = list(beta0 = 2, beta1 = 0.04, 
    mu = 20, sigma = 0.6))

Coefficients:
     beta0      beta1         mu      sigma 
-7.9429750  0.0388271 10.0570250  0.6179119

The MLE of standard deviation is 0.6179119, which is different than the estimation obtained in the previous question. The two values are different because the MLE is biased. It has no correction to the degrees of freedom/sample size (1/n in the equation instead of 1/(n-1)). As a result, I would prefer the estimation obtained in the previous question as it is unbiased.

A.6.3

Interpret the estimates of the regression coefficients and the error standard deviation as obtained in A.6.1. What is the increase in expected GPA for an increase of 2 points in the ACT score?

(6 + 2 = 8 points)

The \(\beta_0\) coefficient was found to be 2.11405 with a standard error of 0.32089, a t-value of 6.588, and a p-value of 1.3e-09. This means that a student with an ACT score of 0 would be expected to have a GPA of 2.11405. This value is not entirely meaningful, however, as a score of 0 is not possible on the ACT (the lowest score is 1). The p-value of this coefficient is highly significant, indicating that there is a relationship between ACT and GPA and that this intercept is significant.

The \(\beta_1\) coefficient was found to be 0.03883 with a standard error of 0.01277, a t-value of 3.040, and a p-value of 0.00292. This means that each additional point on the ACT is associated with a 0.03883 increase in GPA. The p-value of this coefficient is statistically significant (falling below 0.01), indicating that there is a relationship between ACT and GPA.

For an increase of 2 points in ACT score, one could expect and increase of 0.07766 points in GPA.

A.6.4

Does ACT have a statistically significant relationship with the GPA? Justify your answer.

(1 + 2 = 3 points)

Yes, ACT does have a statistically significant relationship with GPA. The regression model created above had a p-value of 0.002917. This value is less than both 0.05 and 0.01, indicating that the null hypothesis should be rejected and that it can be concluded that a significant relationship exists between ACT and GPA.

A.6.5

Plot the estimated regression function and the data. Does the estimated regression function appear to fit the data well?

(3 + 1 = 4 points)

library(ggplot2)
act_gpa <- read.table('./ACT_GPA.txt')
colnames(act_gpa) <- c("GPA", "ACT")
act_gpa <- act_gpa[-c(1), ]
chars <- sapply(act_gpa, is.character)
act_gpa[ , chars] <- as.data.frame(apply(act_gpa[ , chars], 2, as.numeric))
model2 <- lm(GPA~ACT, data = act_gpa)
PI <- predict(model2, interval = "prediction")

Warning in predict.lm(model2, interval = "prediction"): predictions on current data refer to _future_ responses

CI <- predict(model2, interval = "confidence")
lwr <- PI[,2]
upr <- PI[,3]
ggplot(data = act_gpa, aes(x = ACT, y = GPA)) + geom_point() + geom_smooth(method = "lm", color = 'blue', se = TRUE) + geom_line(aes(y=lwr), linetype = "dashed") + geom_line(aes(y=upr), linetype = "dashed")

`geom_smooth()` using formula = 'y ~ x'

While the function seems to adequately capture the general trend of the data, as indicated by the significant p-values of the model and coefficients, some of the data points fall relatively far away from the regression line. In order words, the prediction interval and residual standard error of this regression function seem to be relatively large, meaning that it is not necessarily a good fit for some of the data.

A.6.6

Include the 95% confidence and prediction intervals in the above plot.

(6 points)

See above plot.

A.6.7

Obtain a point estimate, and the 95% confidence and prediction intervals of the freshman GPA for students with an ACT score of \(30\).

(1 + 2 + 2 = 5 points)

The point estimate can be obtained by plugging \(X = 30\) into the equation \(\hat{Y}=2.11405+0.03883(X)\). The resultant estimate is that \(GPA = 3.27895\) for students with an ACT score of \(30\).

The 95% confidence interval for this estimate is (3.104246, 3.453481).

The 95% prediction interval for this estimate is (2.032612, 4.525114).

A.6.8

The intercept of the model developed in Q4(a) is the expected GPA when the ACT score is zero. However, the ACT score can never be zero as the minimum possible ACT score is 1 (ref). So, should the intercept be removed from the model? Why or why not?

(2 + 4 = 6 points)

The intercept should not be removed from the model. While its estimate of GPA when the ACT score is zero is not meaningful because it cannot happen, the intercept still is an important parameter that contributes to accurate point estimates throughout the rest of the model. The intercept is key to having an unbiased estimate of the slope of the regression line.

A.7

Consider the regression model in A.3, where the parameters are estimated using Maximum likelihood estimation. Let \(e_i\) denote the \(i^{th}\) residual. Prove that:

\(\sum_{i = 1}^n \hat{Y}_ie_i = 0\)
\(\sum_{i=1}^n Y_i = \sum_{i=1}^n \hat{Y}_i\)
The regression line passes through the point (\(\bar{X}, \bar{Y}\))

(3 + 3 + 3 = 9 points)

This question was completed on paper.

Consider the regression model:

\(Y_i = \beta_0 + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)\)

Derive the maximum likelihood estimate of \(\beta_0\), and show whether it is a biased or unbiased estimate of \(\beta_0\).

(3 + 3 = 6 points)

This question was completed on paper.

Consider the regression model:

\(Y_i = \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)\)

Derive the maximum likelihood estimate of \(\beta_1\), and show whether it is a biased or unbiased estimate of \(\beta_1\).

(4 + 5 = 9 points)

This question was completed on paper.