Assignment B Allison Kenney

Inference & Correlation

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

  3. Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

  4. The assignment is worth 100 points, and is due on 14th October 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
  • Final answers of each question are written clearly (1 pt).
  • The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

All question are based on the normal linear regression model below, unless otherwise stated:

\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)...(1)\)

B.1

The dataset crime.txt consists of number of crimes per 100,000 residents and percentage of individuals having high school diploma in 84 counties.

B.1.1

Estimate the expected decrease in number of crimes per 100,000 residents when the proportion of individuals having high school diploma increases by 1%. Use a 99% confidence interval. Interpret the interval estimate.

(4 + 2 = 6 points)

crime <- read.table('./crime.txt')
crime <- crime[-c(1), ]
colnames(crime) <- c("Crimes", "Diplomas")
chars <- sapply(crime, is.character)
crime[ , chars] <- as.data.frame(apply(crime[ , chars], 2, as.numeric))
fit <- lm(Crimes ~ Diplomas, data = crime)
summary(fit)

Call:
lm(formula = Crimes ~ Diplomas, data = crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-5278.3 -1757.5  -210.5  1575.3  6803.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20517.60    3277.64   6.260 1.67e-08 ***
Diplomas     -170.58      41.57  -4.103 9.57e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2356 on 82 degrees of freedom
Multiple R-squared:  0.1703,    Adjusted R-squared:  0.1602 
F-statistic: 16.83 on 1 and 82 DF,  p-value: 9.571e-05
confint(fit, level=.99)
                 0.5 %      99.5 %
(Intercept) 11874.0517 29161.14822
Diplomas     -280.2118   -60.93856

The number of crimes per 100,000 residents would be expected to decrease by 170.58, as this is the slope (\(\beta_1\)) of the fitted regression line. With a 99% confidence interval, we can estimate the expected decrease to be 170.58 +/- 109.6318. This means that when the proportion of individuals having high school diplomas increases by 1%, we can be 99% confident that the number of crimes will decrease by between 60.94 and 280.21 per 100,000 residents.

B.1.2

For what percentage of individuals having a high school diploma in a county will you be the most confident in estimating the crime rate with a confidence interval of a given width?

(2 points)

For a confidence interval of a given width, you will be most confident in estimating the crime rate at the mean of the independent variable. Here, this is when the percentage of individuals having a high school diploma is 78.59524. This is true because (\(x_h - \bar{x}\)) is in the equation for the variance of (\(\hat{Y}\)), and (\(x_h - \bar{x}\)) is only equal to 0 (thus minimizing the equation) when (\(x = \bar{x}\)).

B.1.3

For what range of values of the percentage of individuals having a high school diploma in a county will it be inappropriate to use the developed model (in B.1.1) to predict the crime rate?

(2 points)

It would be inappropriate to use the developed model to predict crime rate for values of the percentage of individuals having a high school diploma that fall outside the range of those in the data set. In this case, this means percentage values less than 61 and greater than 91.

B.1.4

Predict the number of crimes per 100,000 residents of a county using a 90% prediction interval if 75% of the individuals have a high school diploma in the county. Interpret the prediction interval.

(2 + 2 = 4 points)

crime <-read.table('crime.txt', header=TRUE)
fit <- lm(crimes~high_school_diploma, data = crime)
newx <- data.frame(high_school_diploma = c(75))
predict(fit,newdata = newx,level=.9,interval="predict")
       fit     lwr     upr
1 7724.461 3773.32 11675.6

We can be 90% confident that if 75% of individuals have a high school diploma in the county, there will be between 3,773.32 and 11,675.6 crimes per 100,000 residents. (7724.461 +/- 3951.139)

B.1.5

A consultant had stated that the expected number of crimes per 100,000 should reduce by at least 1000 for a 4% increase in the proportion of people having high school diplomas. Conduct a hypothesis test to verify the statement. Consider the probability of type 1 error as 5%. State the null and alternate hypotheses, p-value, and conclusion from the test.

(2 + 2 + 2 = 6 points)

crime <-read.table('crime.txt', header=TRUE)
fit <- lm(crimes~high_school_diploma, data = crime)
cv <- qt(.95,df=82)
cv
[1] 1.663649
x = (41*cv)-250
x
[1] -181.7904

Null Hypothesis: \(\beta_1 <= -250\)

Alternative Hypothesis: \(\beta_1 > -250\)

The p-value here is .1007, so we fail to reject the null hypothesis. Additionally, the calculated critical value was -181.7904, which is greater than -250, so we fail to reject the null hypothesis.

B.1.6

What is the probability that you will reject the null hypothesis in the previous question if it is actually false, and the true value of \(\beta_1\) is \(\beta_1 = -200\).

(7 points)

The critical value calculated above was not less the observed value of -200, so we will fail to reject the null hypothesis once again. The area past the critical value before the observed value is 0, so the probability of rejecting the null hypothesis if it is false is 0.

B.1.7

By how much is the total variation in crime rate reduced when percentage of high school graduates is introduced into the analysis?

(2 points)

crime <-read.table('crime.txt', header=TRUE)
fit <- lm(crimes~high_school_diploma, data = crime)
summary(fit)

Call:
lm(formula = crimes ~ high_school_diploma, data = crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-5278.3 -1757.5  -210.5  1575.3  6803.3 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         20517.60    3277.64   6.260 1.67e-08 ***
high_school_diploma  -170.58      41.57  -4.103 9.57e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2356 on 82 degrees of freedom
Multiple R-squared:  0.1703,    Adjusted R-squared:  0.1602 
F-statistic: 16.83 on 1 and 82 DF,  p-value: 9.571e-05

R-squared = 0.1703, which tells us that variation in crime rate is reduced by 17.03% when percentage of high school graduates is introduced.

B.2

Suppose that the normal error regression model is applicable except that the error variance is not constant; rather the variance is larger, the larger is \(X\). Does \(\beta_1=0\) still imply that there is no linear association between \(X\) and \(Y\)? Explain.

Does it also imply that there is no association between \(X\) and \(Y\)? Or, does it also imply that there is an association between \(X\) and \(Y\)? Explain.

(2 + 2 + 2 = 6 points)

\(\beta_1=0\) always implies that there is no linear relationship between \(X\) and \(Y\) because it means that changes in \(X\) are not at all associated with a linear response in \(Y\). However, the error variance not being constant, and in fact growing larger with \(X\), demonstrates that there is in fact a relationship between \(X\) and \(Y\); the relationship is just not linear. If there were no relationship whatsoever, the variance would not grow with \(X\).

B.3

Show that the regression sum of squares has only one degree of freedom for model (1). You may use the normal equations obtained from minimizing of sum of squared errors.

(4 points)

See attached scanned file.

B.4

In a small-scale regression study, five observations on \(Y\) were obtained corresponding to \(X=1, 4, 10, 11,\) and \(14\). Assume that \(\sigma = 0.6, \beta_0 = 5,\) and \(\beta_1=3\).

B.4.1

What are the expected values of MSR and MSE?

(3 + 2 = 5 points)

total = 0
x <- c(1,4,10,11,14)
xbar = mean(x)
total <- total + (1-xbar)^2
total <- total + (4-xbar)^2
total <- total + (10-xbar)^2
total <- total + (11-xbar)^2
total <- total + (14-xbar)^2
ssr = (3^2)*total
ssr
[1] 1026

From the textbook, \(MSR=SSR\), so the expected value of \(MSR\) is 1026.

From the textbook, \(E\{MSE\}=\sigma^2\), so the expected value of \(MSE\) is 0.36.

B.4.2

For determining whether or not a regression relation exists, would it have been better or worse to have made the five observations at \(X=6,7,8,9,10\)? Why? Would the same answer apply if the principal purpose were to estimate the mean response for \(X=8\). Explain.

(3 + 3 = 6 points)

It would have been to worse to have the observations at those locations because it would have introduced less variation in the model creation and limited the range of x values that it could be applied to. If the purpose were to estimate the mean response for \(X=8\), these new values would have been fine because \(8=\bar{x}\) and the values the regression would be built on would be centered around 8.

B.5

Consider the following code and its output below.

#Setting seed for reproducibility
set.seed(10)

#Simulating data
x <- seq(0, 1, 0.01)
y <- 0.5 + 2*x + rnorm(length(x))
data <- data.frame(x = x, y = y)

#Developing linear regression model
model <- lm(y~x, data = data)
summary(model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8866 -0.6294  0.0103  0.6819  2.2644 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.08838    0.18370   0.481    0.631    
x            2.53776    0.31738   7.996 2.45e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9299 on 99 degrees of freedom
Multiple R-squared:  0.3924,    Adjusted R-squared:  0.3863 
F-statistic: 63.94 on 1 and 99 DF,  p-value: 2.448e-12

We see that the intercept is statistically insignificant. As it is insignificant, a student suggested that it won’t matter if it is removed from the developed model, i.e., if the developed model is changed to:

\(Y_i = \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2) ... (2)\)

B.5.1

Given that \(\beta_0 \ne 0\), derive the expression for the bias in the estimate of \(\beta_1\) if model (2) is considered.

(4 points)

Shown on attached paper.

B.5.2

Use the simulated data to compute the value of the bias obtained from the expression derived in the previous question (B.5.1).

(2 points)

\(\beta_1 \space bias = \beta_0(\Sigma X_i/\Sigma X_i^2)\)

We know from the given equation that the true value of \(\beta_0=0.5\)

tot_x = 0
for(i in 1:101){
  tot_x <- tot_x + x[i]
}
tot_x2 = 0
for(i in 1:101){
  tot_x2 <- tot_x2 + (x[i]*x[i])
}
bias <- .5*(tot_x/tot_x2)
bias
[1] 0.7462687

The computed value of the bias is 0.7463.

B.5.3

Use simulations to verify the value of bias computed in the previous question (B.5.2). Simulate the data 1000 times. Set a unique value of seed to generate a unique dataset in every iteration. In each iteration, estimate \(\beta_1\). Report the bias in the estimate of \(\beta_1\) based on the 1000 simulations. Is it the same as obtained analytically in the previous question (B.5.2)?

Note that: \(Bias(\hat{\beta}_1) = E(\hat{\beta}_1 - \beta_1)\)

(5 + 1 = 6 points)

s = 1
beta1_values <- rep(0,1000)
for(i in 1:1000){
  set.seed(s)
  x <- seq(0, 1, .01)
  y <- .5+2*x + rnorm(length(x))
  data <- data.frame(x=x, y=y)
  model <- lm(y~x-1, data=data)
  beta1_values[i] <- model$coefficients['x']
  s <- s+2
}

mean(beta1_values)-2
[1] 0.7408974

The bias estimate after 1000 simulations is 0.7408974. This value is extremely close to the estimate obtained in the previous question but not identical, but that is to be expected. The simulated value is obtained from looking at only a sample of possible biases and will naturally vary from the expected/calculated value due to the random error in the simulated data.

B.5.4

Plot the 1000 regression lines based on the 1000 coefficient estimates of the no-intercept model (as obtained in B.5.3) over a scatterplot of the data points.

(2 points)

yvals <- matrix(rep(0,101000),101,1000)
xvals <- matrix(rep(0,101000),101,1000)
for(i in 1:1000){
  x <- seq(0, 1, .01)
  y <- beta1_values[i]*x
  for(j in 1:101){
    y_cur <- y[j]
    yvals[j,i] <- y_cur
  
  }
  for(k in 1:101){
    x_cur <- x[k]
    xvals[k,i] <- x_cur
  
  }
  #data <- data.frame(x=x, y=y)
  #ggplot(data = data, aes(x = x, y = y)) + geom_point() 
}
require(reshape2)
Loading required package: reshape2
require(ggplot2)
Loading required package: ggplot2
x_m <- melt(xvals)
y_m <- melt(yvals)

data <- data.frame(x=x_m$value, values=y_m$value, variable=rep(paste0('series',1:1000), each = 101))



#ggplot(data=data, aes(x=x, y=values)) + geom_line(aes(colour=variable))

B.5.5

Now consider the simple linear regression model with the intercept (model 1). Simulate the data 1000 times. In each iteration, estimate \(\beta_1\). Report the bias in the estimate of \(\beta_1\) based on the 1000 simulations.

(4 points)

s = 1
beta1_values <- rep(0,1000)
for(i in 1:1000){
  set.seed(s)
  x <- seq(0, 1, .01)
  y <- .5 + 2*x + rnorm(length(x))
  data <- data.frame(x=x, y=y)
  model <- lm(y~x, data=data)
  beta1_values[i] <- model$coefficients['x']
  s <- s+2
}

mean(beta1_values)-2
[1] 0.01029314

The bias estimate after 1000 simulations is 0.01029314.

B.5.6

Plot the 1000 regression lines based on the 1000 coefficient estimates of the model with intercept (as obtained in B.5.5) over a scatterplot of the data points.

(2 points)

B.5.7

Based on the simulations and analytical results, answer the following:

B.5.7.1

If the intercept is found to be statistically insignificant in model (1) as shown in the model summary, should it be removed from the model, and model (1) considered instead? Explain.

(3 points)

In order to remove the intercept, you would want your bias to be zero as this would mean that removing the intercept would not be adding any extra error/bias to the model. Here, it has been shown that the bias would not be nonzero, so the intercept should not be removed from the model.

B.5.7.2

Suppose we know that the true model passes through the origin \((X = 0, Y = 0)\). Should we consider model (1) or model (2) in this case? Explain.

(2 points)

If the model passes through the origin it is best to remove \(\beta_0\) and select model (2). This is because the variance of \(\hat{y}\) depends on the variance of both \(\beta_0\) and \(\beta_1\), so it is best to remove that extra variance if \(\beta_0\) is known to be equal to 0.

B.5.8

In terms of bias and variance, mention an advantage and a disadvantage of the estimate of \(\beta_1\) from model (2) as compared to that obtained from model (1), if \(\beta_0 \ne 0\). The plots developed in earlier questions may also help visualize the advantage and disadvantage.

(2 + 2 = 4 points)

An advantage of the estimate from model (2) is that it has lower variance than that of model (1), however, it also has the disadvantage of having highter bias than model (1).

B.6

Prove that the estimate of the intercept \(\beta_0\), i.e., \(\hat{\beta}_0\) for model (1) has minimum variance among all unbiased linear estimators.

(5 points)

Shown on attached paper.

B.7

Derive the variance of the estimators obtained in questions A.8 and A.9 of Assignment A.

(2 + 2 = 4 points)

Shown on attached paper.

B.8

Suppose the true value of \(\beta_0\) is known, while that of \(\beta_1\) is unknown and needs to be estimated. The stakeholder wishes to choose the model (from model (1) and model(2)), such that the expected squared error in the estimate of \(\beta_1\), is the minimum, i.e., they wish to select the model that has a lesser value of \(E(\hat{\beta}_1 - \beta_1)^2\). Derive the condition (based on the value \(\beta_0\)), which if true will imply that model (2) should be chosen instead of model (1).

Hint:

\(E(X^2) = [E(X)]^2 + Var(X)\)

\(\implies E(\hat{\beta}_1 - \beta_1)^2 = [E(\hat{\beta}_1 - \beta_1)]^2 + Var(\hat{\beta}_1 - \beta_1)\)

Model (2) should be selected when \(E(\hat{\beta}_1 - \beta_1)^2\) for model (2) is lesser than \(E(\hat{\beta}_1 - \beta_1)^2\) for model (1).

(7 points)

Shown on attached paper.