Instructions

Exercise 1: Finding Eigenvalues/Eigenvectors With Iterative Methods.

Often, when analyzing data, we require analyzing the structure of numeric symmetric matrices. We say that a matrix A is symmetric if it satisfies the following 2 properties:

  1. It has the same number of rows and columns, i.e. nrow(A) == ncol(A), and

  2. For any entry [i,j], we have that A[i, j] == A[j, i] is true.

A crucial feature of numeric symmetric matrices is that they have eigenvalues and eigenvectors. Consider a symmetric matrix \(A\in\mathbb{R}^{p\times p}\) of dimension \(p\times p\). Let \(\lambda \in\mathbb{R}\) be a numeric constant and a \(p\)-dimensional column vector \(v\in\mathbb{R}^{p\times 1}\). We say that \((\lambda, v)\) is an eigenvalue-eigenvector pair for the matrix \(A\) if: \[ A\cdot v = \lambda v,\] where \(A\cdot v\) is the usual matrix-vector multiplication. Notice that, for an eigenvector \(v\), calculating matrix multiplication gets simplified to multiplying each entry of the vector by a constant. In this exercise, we create a function for calculating the largest absolute eigenvalue and eigenvector of a symmetric matrix A. We will do this by breaking the algorithm into steps, and then joining them together to create a more complex function. In the following, we consider the \(\|x\|_2\) to represent the Euclidean norm of the vector, given by \[\|x\|_2 = \sqrt{x_1^2 + x_2^2 + ... + x_p^2}.\] In particular, for two \(p\)-dimensional vectors \(x,y\in\mathbb{R}^{p}\), the Euclidean distance between them is given by \(\|x - y\|_2 = \sqrt{(x_1 - y_1)^2 + (x_2 -y_2)^2 + ... + (x_p - y_p)^2}\).

A simple algorithm for finding the eigenpair associated with the largest absolute eigenvalue is to use an iterative method, similar to the one we explored in DA6 for calculating the square root. In this case, the algorithm has as inputs a matrix \(A\), a tolerance level \(\varepsilon > 0\) and a maximum number of iterations \(N\). The algorithm consist of the following steps:

  • Step 1: check if the matrix A is symmetric. If not, stop the procedure.

  • Step 2: randomly select a random \(p\)-dimensional initial vector \(v_{0}\in\mathbb{R}^{p}\).

  • Step 3: Starting with \(v_i = v_0\), we define the next iteration \(v_{i+1}\) with the following formula: \[ v_{i+1} = \frac{A\cdot v_i}{\|A\cdot v_{i}\|_2}\]

  • Step 4: We stop our algorithm if the number of performed iterations reaches a limit or if the distance between the last 2 updates is below the tolerance level: \(\|v_{i} - v_{i-1}\|_2 < \varepsilon\). Say, the algorithm stops at the \(n\)-th iteration \(v_n\).

  • Step 5: After the iterations, our estimation of the eigenvector is \(v_n\), and the eigenvalue is given by \(\lambda = \pm\|A\cdot v_n\|_2\), where the sign depends on whether \(A\) flips the sign of \(v_n\) or not.


1(A) Create a function my.isSymmetric(A, tol = 0.000001) which takes a matrix mat and a tolerance parameter tol = 0.000001. The function determines whether the matrix is symmetric with error tolerance up to the parameter tol. For this,

  1. Check that the matrix is a square matrix.
  2. Check if, for any entry A[i, j], the difference to its symmetric counterpart A[j, i] has absolute value less than the tolerance established.
  3. Return TRUE if the matrix is symmetric, and FALSE if the matrix is not symmetric.

(Hint: you can use the transpose function t())

## [1] TRUE
## [1] FALSE
## [1] FALSE

1(B) Create a function my.euclideanNorm(x = 0) which calculates the Euclidean distance of a \(p\)-dimensional vector x to the zero vector. Remember, for any p-dimensional vector \(x\), the Euclidean norm is given by \(\|x\|_2 = \sqrt{x_1^2 + x_2^2 + ... + x_p^2}.\)

my.euclideanNorm = function(x= 0){
  sqrt(sum(x^(2)))

  return(sqrt(sum(x^(2))))
}
## Your code here:
##
##
##

#######################################
#######################################
## Do not modify code below this line:

my.euclideanNorm(c(1,0,0))
## [1] 1
my.euclideanNorm(c(0,0,0,5))
## [1] 5
my.euclideanNorm(c(3,4))
## [1] 5

1(C) Create a function my.eigen_step(A, v_curr) which performs a single updating step of the iterative algorithm. For this, simply:

  • Calculate the multiplication \(w = A\cdot v_{curr}\).
  • Calculate the norm for \(\|w\|_2\).
  • Return \(v_{new} = w / \|w\|_2\).
my.eigen_step = function(A, v_curr){
    
w= A%*%v_curr

my.euclideanNorm(w) 
v_new = (w/my.euclideanNorm(w))
return(v_new)

}   

1(D) Create a function my.eigen(A, maxiter = 100, tol = 0.000001) where A is the matrix for which we aim to calculate eigenpairs of, maxiter corresponds to the maximum number of iterations of the algorithm (equals 100 as default) and tol, which corresponds to the tolerance level for convergence. A guiding breakdown of the function goes as follows.

Before the while() loop:

1.Check if the matrix is symmetric. If it is not symmetric, print "Error: matrix is not symmetric." and return FALSE, terminating the function call. If it is symmetric, define p = ncol(A) and continue.

  1. Create a random initialization vector by adding v_curr <- rnorm(p). This automatically creates a random initialization vector.
  2. Calculate the first step v_new of the iterative algorithm. Define dist to be the distance between v_curr and v_new in Euclidean norm, and iter <- 1 as the current number of iterations performed.
my.eigen= function(A, maxiter = 100, tol = 0.000001)
{
my.isSymmetric(A)
  if(!my.isSymmetric(A)){
    print("Error: matrix is not symmetric")
    return(FALSE)
  }
  p= ncol(A)
  v_curr = rnorm(p)
  v_new = my.eigen_step(A = A, v_curr = v_curr)
  dist = sqrt(sum((v_curr-v_new)^(2)))
  iter= 1
  while( dist > tol & iter < maxiter){
    v_curr = v_new
    v_new = my.eigen_step(A = A, v_curr = v_curr)
    dist = sqrt(sum((v_curr-v_new)^(2)))
    iter = iter + 1
  
  }
  v_new = round(v_new,8)
  print(paste("Total iterations:", iter))
 
  w = A%*%v_new
  lambda = w%*%solve(v_new)
  #A%*%v_new= lambda%*%v_new
  
  
  return(list(value= lambda , vector= v_new))
  }

while() loop:

  1. Create a while() loop which stops only if iter surpasses maxiter or dist < tol. In this for loop, update v_curr <- v_new, and the value of v_new to be the next step of the iterative method. In addition, remember to update iter and dist. At the end, iter should count the total number of iterations. At the end of the loop, the vector v_new should contain the eigenvector corresponding to the largest eigenvalue.

After while() loop:

  1. Use the round() function on the vector v_new to round the number to decimal 8 digits.
  2. Once the while() is over, print a message indicating the number of interations performed. For example, if iter has value 43, print "Total iterations: 43."
  3. Calculate the vector \(w = A\cdot v_{new}\). Find \(i\) the entry of \(v_{new}\) with the largest absolute value. Then, the eigenvalue is given by \(\lambda = \frac{w_i}{[v_{new}]_i}\).
  4. Return a named list object, with first object value indicating the found eigenvalue, and the second object is named vector and contains the estimated eigenvector.

1(E) A case in which finding the eigenvalues and eigenvectors of a matrix A is easy is when a matrix is square diagonal, meaning that only the entries \(A_{ij}= 0\) for all \(i\neq j\). In this case, the eigenvalues/eigenvectors are given by:

\[ \lambda_i = A_{ii},\hspace{10mm} v_i = (0,0,\ldots,0,\underbrace{1}_{\text{ith-entry}},0,\ldots,0)^T \] (I recommend you spend some time using the matrix multiplication formulas to verify that this is the case)

An easy way to create a diagonal matrix is with the function diag() which receives a vector, and creates a diagonal matrix with the values of the vector as the diagonal. For example, if diag(c(1,2,3)) would create the matrix: \[ M = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3\end{pmatrix}.\]

Create matrices \(A\), \(B\), \(C\) and \(D\) with diagonals:

  1. x_A = c(1, 2, 3, 4, 5)
  2. x_B = c(-5, 4, -3, 2, -1)
  3. x_C = c(1, 1, 1)
  4. x_D = c(-1, -1, -1)

Calculate the eigenvalue/eigenvectors of each matrix. Then, comment below: did the algorithm give the output you were expecting? Are all the resulting outputs eigenpairs of the matrices? Why?


Answer here:

 x_A= diag(c(1,2,3,4,5))
my.eigen(x_A)
## [1] "Total iterations: 57"
## Error in solve.default(v_new): 'a' (5 x 1) must be square
x_B = diag(c(-5, 4,-3, 2, -1))
my.eigen(x_B)
## [1] "Total iterations: 100"
## Error in solve.default(v_new): 'a' (5 x 1) must be square
x_C = diag(c(1, 1, 1))
my.eigen(x_C)
## [1] "Total iterations: 2"
## Error in solve.default(v_new): 'a' (3 x 1) must be square
x_D = diag(c(-1, -1, -1))
my.eigen(x_D)
## [1] "Total iterations: 100"
## Error in solve.default(v_new): 'a' (3 x 1) must be square

Exercise 2: Principal Component Analysis in 2D

One of the most common techniques for performing multivariate data visualization is to use Principal Component Analysis (PCA). Our human spatial intuition can easily visualize objects in 2 or 3 dimensions (so, 2 or 3 variables simultaneously). In cases where we want to visualize more than 3 variables in a static graph, it becomes hard to visualize. PCA is a useful technique for finding the most representative directions of the data, so we can reduce 10 variables back to a 2-dimensional or 3-dimensional scatterplot.

In this exercise, we visualize the meaning of PCA for a dataset with 2 variables.

2(A) In the following chunk, there is a code that generates a 2 dimensional dataset with \(p = 2\) variables and \(n = 30\) observations. After the provided chunk, generate a scatterplot of with:

  1. Variable 1 in the x-axis,
  2. Variable 2 in the y-axis,
  3. The x-limits and y-limits must be both c(-8,8),
  4. Add red vertical line at 0 and a red horizontal at 0,
  5. Color with blue and pch = 15 the observations c(14, 72, 131), and
  6. Choose an appropriate title for the plot.

Answer the following questions: Are the two variables related? If they are related, is the relationship linear? Is the relationship positive or negative?

###############################################
## Do not modify this part of the code.
###############################################

p <- 2
n <- 30

set.seed(123)
V <- matrix(c( c(3,2)/sqrt(13) , c(-2,3)/sqrt(13)  ), ncol = 2)
Svals <- diag(c(9,0.5))

X <-  matrix(rnorm(n * p), ncol = p) %*% sqrt(Svals) %*% t(V)
X <- scale(X, center = TRUE, scale = FALSE)

###############################################
## After this, add your code:
###############################################

## Your code here:
plot(x= X, main=  "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))

abline(v = 0, col="red")
abline(h = 0, col="red")


Answer here: It looks like a positive correlation with a linear relationship between and X and Y variables. —

2(B) Use the function cor to calculate the correlation between the 1st and 2nd variable in the matrix X. Does this confirm your observations in the previous part? Comment below.

cor(X)
##           [,1]      [,2]
## [1,] 1.0000000 0.9085112
## [2,] 0.9085112 1.0000000

Answer here: Yes it does confirm a positive, linear correlation. —

2(C) Create a matrixcov_matcontaining the matrix \(\frac{1}{n - 1} \cdot X ^T \cdot X\), where \(\cdot\) represents the matrix product. This matrix contains information about the variance/covariance behavior in the dataset. This must be a 2 x 2 matrix, so it is square. Print the resulting matrix. Is it symmetric? Use the function my.isSymmetric() from part 1

Use the function eigen() which is part of the base programming of R to obtain the eigenvalues and eigenvectors of the matrix cov_mat. Save the largest eigenvalue in a variable val1 and its corresponding top eigenvector in a variable vec1.

To ensure you use the function eigen() correctly, read the documentation provided by ?eigen. Calculate the norm of the vector vec1. If you want to double check how to use it, feel free to compare the results of eigen() with the results of your function my.eigen().

n = nrow(X) #ofrows
cov_mat = 1/(n-1)*t(X)%*%X
my.isSymmetric(cov_mat)
## [1] TRUE
eigen(cov_mat) 
## eigen() decomposition
## $values
## [1] 8.6707892 0.3397214
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.8498306  0.5270560
## [2,] -0.5270560 -0.8498306
PCA_results= eigen(cov_mat)

2(D) In PCA analysis, the top eigenvector of cov_mat represents the direction of highest variation in the data. Since the vector vec1 has norm 1, it only contains information about the direction, but not about the magnitude of variation in this direction. The information regarding the size of variation in this direction is captured by the top eigenvalue, which represents the variance explained by the direction highest variation. From this, sqrt(val1) represents the standard deviation of the data in the direction of vec1.

To visually confirm this, in this part we create a plot that displays the direction/size of variation of our data described by the top eigenvalue/eigenvector in the data.

Recreate the plot of the data with all the elements you added in part 1(a). Now, in addition to what you have added, use the function segments() to create a "blue" line segment that connects the 2-dimensional points p0 = (x0, y0) = - 2 * sqrt(val1) * vec1 and the second point p1 = (x1, y1) = 2 * val1 * vec1, with line thickness lwd = 2. Use ?segments to learn how to use this function.

After creating the plot, comment below: How does the line segment relate to the association of the dataset? Do you consider that vec1 represents a direction in which the dataset varies highly? Is sqrt(val1) representing the level of variation in this direction?

## Your code here:

plot(x= X, main=  "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))

p0 = -2*sqrt(8.671)*c(-0.8498, -0.527)
p1 = 2*(8.671)*c(-0.8498, -0.527)
abline(v = 0, col="red")
abline(h = 0, col="red")
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col = "blue",lwd = 2)


Answer here: The line segment shows the relationship for both eigenvalue and eigenvector. Yes, the data set shows that the data varies for vector1. Yes, the square root of value1 represents the variation. —

2(E) So far, we have visualized what is the relationship between the top eigenvector of cov_mat and the variation of the dataset X. Now, we study what is the relationship between 2nd (and last) eigenvalue/eigenvector of cov_mat and the variation in X. Since this is the smallest eigenvalue, this direction represents the direction of least variation in the data.

To visually confirm this, in this part we create a plot that displays both the direction/size of variation of our data described by the top and bottom eigenvalues/eigenvectors.

For this, save the 2nd eigenvalue and eigenvector of cov_mat in val2 and vec2. Then, create a new version of the plot you created in part 2(D), adding now an additional black segment connecting the opposite points p0 = (x0, y0) = - 2 * sqrt(val2) * vec2 and the second point p1 = (x1, y1) = 2 * val2 * vec2, and line thicnkess lwd = 2.

After creating the plot, comment below: How does the 2nd line segment relate to the association of the dataset? Do you consider that vec2 represents a direction of low variability of the data? Is sqrt(val2) representing the level of variation in this direction?

## Your code here:
plot(x= X, main=  "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))
val1= PCA_results$values[1]
vec1= PCA_results$vectors[,1]
val2= PCA_results$values[2]
vec2= PCA_results$vectors[,2]
p0 = -2*sqrt(val1)*vec1
p1 = 2*(val1)*vec1
abline(v = 0, col="red")
abline(h = 0, col="red")
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col ="blue",lwd = 2)
p0 = -2*sqrt(val2)*vec2
p1 = 2*(val2)*vec2
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col = "black",lwd = 2)


Answer here: The direction of the line segment is based on the eigenvector. Yes, vec2 does suggest a low variability based on the line segment. Yes, the sqrt of val2 does represent the variability of the direction of the line. —

future assignments, we will further explore how to use PCA as a method for visualizing numerical data when the number of variables p > 3.


Exercise 3: Verifying Confidence Interval Coverage with Simulations

When doing statistical analysis, constructing confidence intervals for sample means is one of the most essential procedures. Say that we have a population with distribution \(P\), with population mean \(\mu\) and standard deviation \(\sigma\). Often, we don’t have access to information from the entire population, so we randomly sample \(n\) independent observations from this population \(X_1,X_2,X_3,\ldots,X_n\sim P\), and calculate its sample mean and sample standard deviation,

\[\bar{x}_n := \frac{1}{n}(X_1 + X_2 +\ldots + X_n), \hspace{10mm} \hat{s}_n := \frac{1}{n-1}\bigg( (X_1-\bar{x})^2 + \ldots + (X_n - \bar{x})^2 \bigg).\]

Since both \(\bar{x}_n\) and \(\hat{s}_n\) are derived from the sample, they are random quantities that help approximate \(\mu\) and \(\sigma\) respectively. Due to this randomness, we know that the exact equality \(\bar{x} = \mu\) is rather unlikely.

Confidence intervals help us determine a range within which it is highly likely that the true population mean \(\mu\) is, given the available information we have. More specifically, a confidence interval with confidence level 0.95 is a data-generated interval which has 95% chance of containing the population mean \(\mu\). For a given values of \(\bar{x}_n\) and \(\hat{s}_n\), the \(0.95-\)confidence interval is given by: \[ CI(95\%) = \bar{x}_n \pm T_{n-1}(0.975)\times \frac{\hat{s}_n}{\sqrt{n}},\] where \(T_{n-1}(0.975)\) is the 97.5% percentile of the T-distribution with \(n-1\) degrees of freedom.

More generally, a confidence interval of confidence level \(1-\alpha\) is an interval which has probability \(1-\alpha\) of containing the population mean \(\mu\). For a given values of \(\bar{x}_n\) and \(\hat{s}_n\), the \((1-\alpha)-\)confidence interval is given by: \[ CI(95\%) = \bar{x}_n \pm T_{n-1}\bigg(1 -\frac{\alpha}{2}\bigg)\times \frac{\hat{s}_n}{\sqrt{n}},\] where \(T_{n-1}\bigg(1 - \frac{\alpha}{2}\bigg)\) is the \((1 - \frac{\alpha}{2})-\)percentile of the T-distribution with \(n-1\) degrees of freedom.

In this exercise, we perform simulations that help us better understand how confidence intervals help find an interval range that is highly likely to contain the population mean of a population.

3(A) Import the dataset cdc.csv. This dataset consists of 20,000 measurements of clinical variables. In this analysis, we are interested in analyzing the variable height. We consider these 20,000 observations as our total population.

Create a vector heights which contains all of the 20,000 observations of this variables in the cdc dataset. Then, create a variable mu, which contains the mean of height, and a variable sigma containing the standard deviation of all the observations.

Print the following quantities to console:

  1. The value of mu.

  2. The standard deviation sigma.

  3. What is the average height of individuals in the common feet + inches notation? Print as a vector of 2 entries with feet and inches as its entries.

  4. What proportion of individuals in the data are taller than 6 ft?

  5. What proportion of individuals in the data are between or equal to 5 and 6 ft tall?

  6. What proportion of individuals are strictly taller than mu + 2 * sigma?

Add text to each of your prints describing what each printed quantity represents. For example, for the first, you could print "Population Mean = *insert value here*.

cdc <- read.csv("C:/Users/pplec/Downloads/cdc (1).csv")
## Your code here:
heights = cdc$height
mu = mean(heights) ##population mean= 67.1829
sigma = sd(heights)
print(paste("population mean=", mu))
## [1] "population mean= 67.1829"
print(paste("population sigma=", sigma))
## [1] "population sigma= 4.12595428536699"
vec_height= c(floor(mu/12),mu%%12)
print(paste("average height=",vec_height[1],"feet", vec_height[2], "inches"))
## [1] "average height= 5 feet 7.1829 inches"
mean(heights>72) 
## [1] 0.10275
print(paste("proportion of heights > 72 inches : ",mean(heights>72)))
## [1] "proportion of heights > 72 inches :  0.10275"
mean(heights>=60 & heights<=72)
## [1] 0.88325
print(paste("proportion of heights between 5 and 6feet tall : ",mean(heights>=60 & heights<=72)))
## [1] "proportion of heights between 5 and 6feet tall :  0.88325"
mean( heights>(mu+2*sigma))
## [1] 0.01725
print(paste("proportion heights that are taller than 2 standard deviations from the mean : ", mean( heights>(mu+2*sigma))))
## [1] "proportion heights that are taller than 2 standard deviations from the mean :  0.01725"

3(B) As mentioned before, we assume that we do not have access to the full 20,000 observations of the dataset. Instead, imagine that we only have access to a random sample of 84 observations from this total dataset. We are interested in exploring how good of an approximation is \(\bar{x}_{84}\) to the true population mean \(\mu\) we calculated before. We here construct a 0.95-confidence interval using this \(n=84\) random sample.

Use the function sample() to gather a sample of 84 observations from the vector height with replacement. Then, calculate the sample mean xbar and sample standard deviation sdhat.

Use a base-R function to recover the 0.975-percentile of the T-distribution. Finally, use the information gathered to define a vector CI containing the lower/upper limits of the 95% confidence interval for the sample gathered. Do the following:

  1. Print "0.95-Confidence Interval: ( *lower-lim* , *upper-lim* )".
  2. Print True mean : *value of mu*.
  3. Print mu within the 0.95-CI: *answer*.
#######################################
## DO NOT MODIFY
set.seed(100)
#######################################
#######################################


## Your code here:
height_exp = sample(x = heights, size = 84, replace = FALSE)
xbar= mean(height_exp)
sdhat= sd(height_exp)
print(height_exp)
##  [1] 70 76 70 63 64 65 68 62 71 67 73 65 70 60 63 68 63 66 72 72 76 66 66 65 71
## [26] 63 62 73 72 69 70 64 72 65 67 66 63 72 63 63 73 63 67 62 63 69 66 62 70 66
## [51] 66 60 70 64 64 71 68 70 63 65 76 63 70 67 66 71 68 67 61 71 70 69 72 64 69
## [76] 64 67 63 60 67 65 67 62 70
print(xbar)
## [1] 66.9881
print(sdhat)
## [1] 3.922423
n=84
height_upper = xbar + qt(p = .975, df = 83)*sdhat/sqrt(n)
height_lower = xbar - qt(p = .975, df = 83)*sdhat/sqrt(n)
print(paste("0.95-Confidence Interval:(" ,height_lower ,",", height_upper , ")" ))
## [1] "0.95-Confidence Interval:( 66.1368773606768 , 67.8393131155137 )"
print(paste("True mean : ", mu))
## [1] "True mean :  67.1829"
print(paste("mu within the 0.95-CI :","Yes"))
## [1] "mu within the 0.95-CI : Yes"

3(C) Let us now study the following claim: A 0.95-confidence interval is a random interval, which has a 95% chance of containing the true value of the population mean. For this, we simulate 1000 different confidence intervals, and we see what proportion of these intervals contain the true mean mu.

  1. Create a 1000 x 2 matrix CIs. Each row will correspond to a different simulation of a confidence interval.

  2. For each row index ind, generate a random sample of \(n = 84\) observations from the vector height without replacement.

  3. For each of these samples, calculate the confidence interval and save its values in the row CIs[ind, ].

  4. Calculate the proportion of generated intervals that contains the true mean mu. Print it as "Proportion of CIs containing mu = *calculated proportion*".

  5. Comment below: what does this proportion represent? Answer below.

CIs = matrix(-1, nrow = 1000, ncol = 2)
set.seed(465)
for(ind in 1:1000){
  height_exp = sample(x = heights, size = 84, replace = FALSE)
  xbar= mean(height_exp)
  sdhat= sd(height_exp)
  n=84
  height_upper = xbar + qt(p = .975, df = 83)*sdhat/sqrt(n)
  height_lower = xbar - qt(p = .975, df = 83)*sdhat/sqrt(n)
  CIs[ind,] = c(height_lower, height_upper)
  }
  mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.941
  print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu =  0.941"

Answer here: The proportion represents the amount of times that mu is in the lower and upper bound of the interval. —

3(D) Now, lets visualize how the random confidence intervals you generated distribute when compared to the true mean mu. For this, create the following plot:

  1. Create a 100 x 2 matrix CIs_red that contains the first 100 confidence intervals you calculated before.

  2. Create a vector of length 100, CIs_cols which assigns color “blue” to intervals that contain mu, and “red” to the intervals that do not contain mu.

  3. Create a plot with limits xlim = c(0,100) and ylim = c(64,70), and a horizontal line on the value mu. (Hint: use argument type = "l").

  4. Using the function segments(), add to this plot the confidence intervals contained in the matrix CIs_red. Display it such that the confidence interval CIs_red[ind, ] is displayed vertically, with coordinates (x0,y0) = (ind, CIs_red[ind,1]) and (x1,y1) = (ind, CIs_red[ind,2]). In addition, color the interval "red" if the interval does not contain mu, and "blue" if it contains mu.

  5. Using print, calculate how many intervals in the plot do not contain the true mean mu. Add text to provide description in your print.

CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red") 
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")
for(ind in 1:100){
  segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])

}

mean(CIs_cols == "red")
## [1] 0.059
print(paste("The Proportion of Intervals not Containing Mu =", mean(CIs_cols == "red") ))
## [1] "The Proportion of Intervals not Containing Mu = 0.059"

3(E) Now, use a similar procedure from 3(C) to calculate a (100 x 2) matrix CIs_n500 which contains 100 random 0.95-confidence intervals, now with a sample size of \(n = 500\). Then, follow the same steps from 3(D) to generate a confidence interval plot for these new intervals. Answer:

  1. How do the 0.95-confidence intervals with \(n=500\) compare to those of \(n = 84\)?

  2. Are there benefits to having a larger sample size in estimating the mean mu?

CIs = matrix(-1, nrow = 100, ncol = 2)
set.seed(465)
n=500
for(ind in 1:100){
  height_exp = sample(x = heights, size = n, replace = FALSE)
  xbar= mean(height_exp)
  sdhat= sd(height_exp)
  height_upper = xbar + qt(p = .975, df = n-1)*sdhat/sqrt(n)
  height_lower = xbar - qt(p = .975, df = n-1)*sdhat/sqrt(n)
  CIs[ind,] = c(height_lower, height_upper)
  }
mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.94
print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu =  0.94"
CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red") 
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")

for(ind in 1:100){
  segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])
  
}

3(F) Now, use a similar procedure from 3(C) to calculate a (100 x 2) matrix CIs_c99 which contains 100 random 0.99-confidence intervals, now with a sample size of \(n = 84\). Then, follow the same steps from 3(D) to generate a confidence interval plot for these new intervals. Answer:

  1. How do the 0.99-confidence intervals with \(n=84\) compare to those of \(n = 84\) with confidence 0.95?

  2. What are the benefits of increasing the confidence of an interval? Are there any drawbacks?

CIs = matrix(-1, nrow = 100, ncol = 2)
set.seed(465)
n=84
for(ind in 1:100){
  height_exp = sample(x = heights, size = n, replace = FALSE)
  xbar= mean(height_exp)
  sdhat= sd(height_exp)
  height_upper = xbar + qt(p = .995, df = n-1)*sdhat/sqrt(n)
  height_lower = xbar - qt(p = .995, df = n-1)*sdhat/sqrt(n)
  CIs[ind,] = c(height_lower, height_upper)
  }
mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.99
print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu =  0.99"
CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red") 
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")

for(ind in 1:100){
  segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])
}


Answer here:


Exercise 4: Testing Significance in Linear Regression

Linear regression is a useful tool to examine the relationship between two variables, specifically how an independent variable or predictor (i.e. \(X\)) affects the dependent variable (i.e. \(Y\)). We can also examine the relationships between multiple \(X\) and \(Y\), but we will focus our attention to only using one predictor. Given \(n\) observations of the form \((x_1, y_1),(x_2, y_2),\ldots, (x_n, y_n),\), a linear regression model assumes \(x_i\) and \(y_i\) are related through the linear equation: \[ y_i = \beta_2 x_i + \beta_1 + \epsilon_i, \hspace{5mm} i = 1,2,\ldots,n \] where \(\beta_2\) represents the effect that the variable \(X\) has on the dependent variable \(Y\), \(\beta_1\) is the \(y\)-intercept, and \(\epsilon_1,\epsilon_2,\ldots \epsilon_n\) represent the error terms, with a standard deviation \(\sigma\). By defining the error vector \(\vec{\epsilon} = (\epsilon_1,\epsilon_2,\ldots \epsilon_n)^T\), we can condense it into the vectorized expression \[ Y = \beta_2 X + \beta_1 + \epsilon. \] We can further simplify the expression to matrix form. We define the column \(2 \times 1\) vector \(\vec{\mathbb{\beta}} = (\beta_1, \beta_2)^T\), and the \(n\times 2\) matrix \(\mathbf{X} = [\vec{1}, X]\), where \(\vec{1} = (1,1,...,1)^T\), and the \(n\times 1\). Then, the expression is equivalent to,

\[ Y = \mathbf{X} \cdot \vec{\beta} + \vec{\epsilon}, \]

where \(\vec{\beta}\) represents the change in \(Y\) for each \(X\) and includes the y-intercept, while \(\vec{\epsilon}\) represents the error between the true value and predicted value, and \(\cdot\) is the matrix multiplication operation. (I suggest you spend some time doing the matrix multiplication to verify that these three expressions represent the same model).

When doing linear regression, our goal is to find an estimate for \(\vec{\beta}\) that reduces the sum of squares between the true value \(Y\) and its linear approximation \(\hat{Y} = \hat\beta_2 X + \hat\beta_1\), better known as least-squares approximation. The least-squares estimator of \(\vec{\beta}\) can be calculated by the following: \[ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y \]

If a predictor is deemed statistically significant, we expect the \(\hat{\beta}\) coefficient related to the predictor in question to be much different from 0. We will walk through a problem to go through both the manual and R calculations in linear regression.

4(A) Load the following dataset into your environment, which details Major League Baseball player hitting statistics for the 2010 season. Create a scatterplot between the number of walks and number of runs, labeled as walk and run. Make sure the following elements are included:

  • walk as your independent variable.
  • Appropriate title and axis labels.
  • Colors based on their teams.

Afterward, describe the relationship between the two variables. Is there a linear relationship?

source("https://www.openintro.org/data/R/mlbbat10.R")

Answer here:


4(B) Similar to the work done in problem #5 in HW2, calculate the linear model coefficients \(\hat{\beta}\) by hand using matrix multiplications. Make sure you obtain a 2-D vector. Recreate the scatterplot, now with the least-squares line. Does it fit the data well?


Answer here:


4(C) In a linear regression, we want to calculate the size of the errors using the following formula: \[ \hat{\sigma} = \sqrt{\frac{RSS}{n-2}} \]

The residual sum of squares (RSS) can be found with the following formula: \[ RSS = \sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 \]

Do the following:

  • Find \(\hat{Y}\) manually using the results from part (B). Hint: Recall that \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\).
  • Find the residual sum of squares (RSS) manually and report the value.
  • Find the size of the error \(\hat{\sigma}\) and report the value.

4(D) Usually, we also want to calculate the proportion of variance explained by the linear model as a value from 0 to 1. The closer to 1, the more of the variation in \(Y\) can be explained by \(X\). This can be done with the following formula: \[ R^2 = \frac{(n - 1) * var(y) - RSS}{(n - 1) * var(y)}. \]

Manually, find \(R^2\) and print the result. Is this a high number?


Answer here:


4(E) Usually, in a linear regression problem, one wants to test \(H_0: \beta_1 = 0\) vs. \(H_a: \beta_1 \neq 0\). To test this, calculate the following test statistic: \[ T = \frac{\hat{\beta_1}}{\sigma * \sqrt{[(X^T X)^{-1}]_{22}}}. \]

4(F) If the null distribution holds, we expect \(T \sim T_{n-2}\). Use this to calculate the \(p\)-value for this test.

4(G) Linear regression can be performed automatically in R, using the function lm(). Use the function lm() on the two variables to calculate the linear regression coefficients and assign it to mod. Display the table of results with the function summary() on mod. Do the results you obtained with the linear regression in R match those you obtained with your by-hand calculations? Is the linear relationship significant at an \(\alpha\) level of 0.05?


Answer here: