josesa@ucr.edu) or Jericho Lawson
(jlaws011@ucr.edu) if you have any questions, using “[STAT
107]” in the subject line.Often, when analyzing data, we require analyzing the structure of
numeric symmetric matrices. We say that a matrix
A is symmetric if it satisfies the following 2
properties:
It has the same number of rows and columns,
i.e. nrow(A) == ncol(A), and
For any entry [i,j], we have that
A[i, j] == A[j, i] is true.
A crucial feature of numeric symmetric matrices is that they have
eigenvalues and eigenvectors. Consider a symmetric
matrix \(A\in\mathbb{R}^{p\times p}\)
of dimension \(p\times p\). Let \(\lambda \in\mathbb{R}\) be a numeric
constant and a \(p\)-dimensional column
vector \(v\in\mathbb{R}^{p\times 1}\).
We say that \((\lambda, v)\) is an
eigenvalue-eigenvector pair for the matrix \(A\) if: \[
A\cdot v = \lambda v,\] where \(A\cdot
v\) is the usual matrix-vector multiplication. Notice that, for
an eigenvector \(v\), calculating
matrix multiplication gets simplified to multiplying each entry of the
vector by a constant. In this exercise, we create a function for
calculating the largest absolute eigenvalue and eigenvector of a
symmetric matrix A. We will do this by breaking
the algorithm into steps, and then joining them together to create a
more complex function. In the following, we consider the \(\|x\|_2\) to represent the Euclidean norm
of the vector, given by \[\|x\|_2 =
\sqrt{x_1^2 + x_2^2 + ... + x_p^2}.\] In particular, for two
\(p\)-dimensional vectors \(x,y\in\mathbb{R}^{p}\), the Euclidean
distance between them is given by \(\|x -
y\|_2 = \sqrt{(x_1 - y_1)^2 + (x_2 -y_2)^2 + ... + (x_p -
y_p)^2}\).
A simple algorithm for finding the eigenpair associated with the largest absolute eigenvalue is to use an iterative method, similar to the one we explored in DA6 for calculating the square root. In this case, the algorithm has as inputs a matrix \(A\), a tolerance level \(\varepsilon > 0\) and a maximum number of iterations \(N\). The algorithm consist of the following steps:
Step 1: check if the matrix A is
symmetric. If not, stop the procedure.
Step 2: randomly select a random \(p\)-dimensional initial vector \(v_{0}\in\mathbb{R}^{p}\).
Step 3: Starting with \(v_i = v_0\), we define the next iteration \(v_{i+1}\) with the following formula: \[ v_{i+1} = \frac{A\cdot v_i}{\|A\cdot v_{i}\|_2}\]
Step 4: We stop our algorithm if the number of performed iterations reaches a limit or if the distance between the last 2 updates is below the tolerance level: \(\|v_{i} - v_{i-1}\|_2 < \varepsilon\). Say, the algorithm stops at the \(n\)-th iteration \(v_n\).
Step 5: After the iterations, our estimation of the eigenvector is \(v_n\), and the eigenvalue is given by \(\lambda = \pm\|A\cdot v_n\|_2\), where the sign depends on whether \(A\) flips the sign of \(v_n\) or not.
1(A) Create a function
my.isSymmetric(A, tol = 0.000001) which takes a matrix
mat and a tolerance parameter tol = 0.000001.
The function determines whether the matrix is symmetric with error
tolerance up to the parameter tol. For this,
A[i, j], the difference to its
symmetric counterpart A[j, i] has absolute value less than
the tolerance established.TRUE if the matrix is symmetric, and
FALSE if the matrix is not symmetric.(Hint: you can use the transpose function t())
## [1] TRUE
## [1] FALSE
## [1] FALSE
1(B) Create a function
my.euclideanNorm(x = 0) which calculates the Euclidean
distance of a \(p\)-dimensional vector
x to the zero vector. Remember, for any p-dimensional
vector \(x\), the Euclidean norm is
given by \(\|x\|_2 = \sqrt{x_1^2 + x_2^2 + ...
+ x_p^2}.\)
my.euclideanNorm = function(x= 0){
sqrt(sum(x^(2)))
return(sqrt(sum(x^(2))))
}
## Your code here:
##
##
##
#######################################
#######################################
## Do not modify code below this line:
my.euclideanNorm(c(1,0,0))
## [1] 1
my.euclideanNorm(c(0,0,0,5))
## [1] 5
my.euclideanNorm(c(3,4))
## [1] 5
1(C) Create a function
my.eigen_step(A, v_curr) which performs a single updating
step of the iterative algorithm. For this, simply:
my.eigen_step = function(A, v_curr){
w= A%*%v_curr
my.euclideanNorm(w)
v_new = (w/my.euclideanNorm(w))
return(v_new)
}
1(D) Create a function
my.eigen(A, maxiter = 100, tol = 0.000001) where
A is the matrix for which we aim to calculate eigenpairs
of, maxiter corresponds to the maximum number of iterations
of the algorithm (equals 100 as default) and tol, which
corresponds to the tolerance level for convergence. A guiding breakdown
of the function goes as follows.
Before the while() loop:
1.Check if the matrix is symmetric. If it is not symmetric, print
"Error: matrix is not symmetric." and return
FALSE, terminating the function call. If it is symmetric,
define p = ncol(A) and continue.
v_curr <- rnorm(p). This automatically creates a random
initialization vector.v_new of the iterative
algorithm. Define dist to be the distance between
v_curr and v_new in Euclidean norm, and
iter <- 1 as the current number of iterations
performed.my.eigen= function(A, maxiter = 100, tol = 0.000001)
{
my.isSymmetric(A)
if(!my.isSymmetric(A)){
print("Error: matrix is not symmetric")
return(FALSE)
}
p= ncol(A)
v_curr = rnorm(p)
v_new = my.eigen_step(A = A, v_curr = v_curr)
dist = sqrt(sum((v_curr-v_new)^(2)))
iter= 1
while( dist > tol & iter < maxiter){
v_curr = v_new
v_new = my.eigen_step(A = A, v_curr = v_curr)
dist = sqrt(sum((v_curr-v_new)^(2)))
iter = iter + 1
}
v_new = round(v_new,8)
print(paste("Total iterations:", iter))
w = A%*%v_new
lambda = w%*%solve(v_new)
#A%*%v_new= lambda%*%v_new
return(list(value= lambda , vector= v_new))
}
while() loop:
while() loop which stops only if
iter surpasses maxiter or
dist < tol. In this for loop, update
v_curr <- v_new, and the value of v_new to
be the next step of the iterative method. In addition, remember to
update iter and dist. At the end,
iter should count the total number of iterations. At the
end of the loop, the vector v_new should contain the
eigenvector corresponding to the largest eigenvalue.After while() loop:
round() function on the vector
v_new to round the number to decimal 8 digits.while() is over, print a message indicating
the number of interations performed. For example, if iter
has value 43, print "Total iterations: 43."list object, with first object
value indicating the found eigenvalue, and the second
object is named vector and contains the estimated
eigenvector.1(E) A case in which finding the eigenvalues and
eigenvectors of a matrix A is easy is when a matrix is
square diagonal, meaning that only the entries \(A_{ij}= 0\) for all \(i\neq j\). In this case, the
eigenvalues/eigenvectors are given by:
\[ \lambda_i = A_{ii},\hspace{10mm} v_i = (0,0,\ldots,0,\underbrace{1}_{\text{ith-entry}},0,\ldots,0)^T \] (I recommend you spend some time using the matrix multiplication formulas to verify that this is the case)
An easy way to create a diagonal matrix is with the function
diag() which receives a vector, and creates a diagonal
matrix with the values of the vector as the diagonal. For example, if
diag(c(1,2,3)) would create the matrix: \[ M = \begin{pmatrix} 1 & 0 & 0 \\ 0 &
2 & 0 \\ 0 & 0 & 3\end{pmatrix}.\]
Create matrices \(A\), \(B\), \(C\) and \(D\) with diagonals:
x_A = c(1, 2, 3, 4, 5)x_B = c(-5, 4, -3, 2, -1)x_C = c(1, 1, 1)x_D = c(-1, -1, -1)Calculate the eigenvalue/eigenvectors of each matrix. Then, comment below: did the algorithm give the output you were expecting? Are all the resulting outputs eigenpairs of the matrices? Why?
Answer here:
x_A= diag(c(1,2,3,4,5))
my.eigen(x_A)
## [1] "Total iterations: 57"
## Error in solve.default(v_new): 'a' (5 x 1) must be square
x_B = diag(c(-5, 4,-3, 2, -1))
my.eigen(x_B)
## [1] "Total iterations: 100"
## Error in solve.default(v_new): 'a' (5 x 1) must be square
x_C = diag(c(1, 1, 1))
my.eigen(x_C)
## [1] "Total iterations: 2"
## Error in solve.default(v_new): 'a' (3 x 1) must be square
x_D = diag(c(-1, -1, -1))
my.eigen(x_D)
## [1] "Total iterations: 100"
## Error in solve.default(v_new): 'a' (3 x 1) must be square
One of the most common techniques for performing multivariate data visualization is to use Principal Component Analysis (PCA). Our human spatial intuition can easily visualize objects in 2 or 3 dimensions (so, 2 or 3 variables simultaneously). In cases where we want to visualize more than 3 variables in a static graph, it becomes hard to visualize. PCA is a useful technique for finding the most representative directions of the data, so we can reduce 10 variables back to a 2-dimensional or 3-dimensional scatterplot.
In this exercise, we visualize the meaning of PCA for a dataset with 2 variables.
2(A) In the following chunk, there is a code that generates a 2 dimensional dataset with \(p = 2\) variables and \(n = 30\) observations. After the provided chunk, generate a scatterplot of with:
c(-8,8),blue and pch = 15 the
observations c(14, 72, 131), andAnswer the following questions: Are the two variables related? If they are related, is the relationship linear? Is the relationship positive or negative?
###############################################
## Do not modify this part of the code.
###############################################
p <- 2
n <- 30
set.seed(123)
V <- matrix(c( c(3,2)/sqrt(13) , c(-2,3)/sqrt(13) ), ncol = 2)
Svals <- diag(c(9,0.5))
X <- matrix(rnorm(n * p), ncol = p) %*% sqrt(Svals) %*% t(V)
X <- scale(X, center = TRUE, scale = FALSE)
###############################################
## After this, add your code:
###############################################
## Your code here:
plot(x= X, main= "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))
abline(v = 0, col="red")
abline(h = 0, col="red")
Answer here: It looks like a positive correlation with a linear relationship between and X and Y variables. —
2(B) Use the function cor to calculate
the correlation between the 1st and 2nd variable in the matrix
X. Does this confirm your observations in the previous
part? Comment below.
cor(X)
## [,1] [,2]
## [1,] 1.0000000 0.9085112
## [2,] 0.9085112 1.0000000
Answer here: Yes it does confirm a positive, linear correlation. —
2(C) Create a matrixcov_matcontaining
the matrix \(\frac{1}{n - 1} \cdot X ^T \cdot
X\), where \(\cdot\) represents
the matrix product. This matrix contains information about the
variance/covariance behavior in the dataset. This must be a 2 x 2
matrix, so it is square. Print the resulting matrix. Is it symmetric?
Use the function my.isSymmetric() from part
1
Use the function eigen() which is part of the base
programming of R to obtain the eigenvalues and eigenvectors of the
matrix cov_mat. Save the largest eigenvalue in a variable
val1 and its corresponding top eigenvector in a variable
vec1.
To ensure you use the function eigen() correctly, read
the documentation provided by ?eigen. Calculate the norm of
the vector vec1. If you want to double check how to use it,
feel free to compare the results of eigen() with the
results of your function my.eigen().
n = nrow(X) #ofrows
cov_mat = 1/(n-1)*t(X)%*%X
my.isSymmetric(cov_mat)
## [1] TRUE
eigen(cov_mat)
## eigen() decomposition
## $values
## [1] 8.6707892 0.3397214
##
## $vectors
## [,1] [,2]
## [1,] -0.8498306 0.5270560
## [2,] -0.5270560 -0.8498306
PCA_results= eigen(cov_mat)
2(D) In PCA analysis, the top eigenvector of
cov_mat represents the direction of highest variation
in the data. Since the vector vec1 has norm 1, it only
contains information about the direction, but not about the
magnitude of variation in this direction. The information regarding the
size of variation in this direction is captured by the top eigenvalue,
which represents the variance explained by the direction highest
variation. From this, sqrt(val1) represents the
standard deviation of the data in the direction of
vec1.
To visually confirm this, in this part we create a plot that displays the direction/size of variation of our data described by the top eigenvalue/eigenvector in the data.
Recreate the plot of the data with all the elements you added in part
1(a). Now, in addition to what you have added, use the function
segments() to create a "blue" line segment
that connects the 2-dimensional points
p0 = (x0, y0) = - 2 * sqrt(val1) * vec1 and the second
point p1 = (x1, y1) = 2 * val1 * vec1, with line thickness
lwd = 2. Use ?segments to learn how to use
this function.
After creating the plot, comment below: How does the line segment
relate to the association of the dataset? Do you consider that
vec1 represents a direction in which the dataset varies
highly? Is sqrt(val1) representing the level of variation
in this direction?
## Your code here:
plot(x= X, main= "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))
p0 = -2*sqrt(8.671)*c(-0.8498, -0.527)
p1 = 2*(8.671)*c(-0.8498, -0.527)
abline(v = 0, col="red")
abline(h = 0, col="red")
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col = "blue",lwd = 2)
Answer here: The line segment shows the relationship for both eigenvalue and eigenvector. Yes, the data set shows that the data varies for vector1. Yes, the square root of value1 represents the variation. —
2(E) So far, we have visualized what is the
relationship between the top eigenvector of cov_mat and the
variation of the dataset X. Now, we study what is the
relationship between 2nd (and last) eigenvalue/eigenvector of
cov_mat and the variation in X. Since this is the smallest
eigenvalue, this direction represents the direction of least
variation in the data.
To visually confirm this, in this part we create a plot that displays both the direction/size of variation of our data described by the top and bottom eigenvalues/eigenvectors.
For this, save the 2nd eigenvalue and eigenvector of
cov_mat in val2 and vec2. Then,
create a new version of the plot you created in part
2(D), adding now an additional black
segment connecting the opposite points
p0 = (x0, y0) = - 2 * sqrt(val2) * vec2 and the second
point p1 = (x1, y1) = 2 * val2 * vec2, and line thicnkess
lwd = 2.
After creating the plot, comment below: How does the 2nd line segment
relate to the association of the dataset? Do you consider that
vec2 represents a direction of low variability of the data?
Is sqrt(val2) representing the level of variation in this
direction?
## Your code here:
plot(x= X, main= "Scatterplot", col= "blue",pch= 15, xlim=c(-8,8), ylim=c(-8,8))
val1= PCA_results$values[1]
vec1= PCA_results$vectors[,1]
val2= PCA_results$values[2]
vec2= PCA_results$vectors[,2]
p0 = -2*sqrt(val1)*vec1
p1 = 2*(val1)*vec1
abline(v = 0, col="red")
abline(h = 0, col="red")
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col ="blue",lwd = 2)
p0 = -2*sqrt(val2)*vec2
p1 = 2*(val2)*vec2
segments(x0 = p0[1], y0 = p0[2], x1 = p1[1], y1 = p1[2],col = "black",lwd = 2)
Answer here: The direction of the line segment is based on the eigenvector. Yes, vec2 does suggest a low variability based on the line segment. Yes, the sqrt of val2 does represent the variability of the direction of the line. —
future assignments, we will further explore how to use PCA as a
method for visualizing numerical data when the number of variables
p > 3.
When doing statistical analysis, constructing confidence intervals for sample means is one of the most essential procedures. Say that we have a population with distribution \(P\), with population mean \(\mu\) and standard deviation \(\sigma\). Often, we don’t have access to information from the entire population, so we randomly sample \(n\) independent observations from this population \(X_1,X_2,X_3,\ldots,X_n\sim P\), and calculate its sample mean and sample standard deviation,
\[\bar{x}_n := \frac{1}{n}(X_1 + X_2 +\ldots + X_n), \hspace{10mm} \hat{s}_n := \frac{1}{n-1}\bigg( (X_1-\bar{x})^2 + \ldots + (X_n - \bar{x})^2 \bigg).\]
Since both \(\bar{x}_n\) and \(\hat{s}_n\) are derived from the sample, they are random quantities that help approximate \(\mu\) and \(\sigma\) respectively. Due to this randomness, we know that the exact equality \(\bar{x} = \mu\) is rather unlikely.
Confidence intervals help us determine a range within which it is highly likely that the true population mean \(\mu\) is, given the available information we have. More specifically, a confidence interval with confidence level 0.95 is a data-generated interval which has 95% chance of containing the population mean \(\mu\). For a given values of \(\bar{x}_n\) and \(\hat{s}_n\), the \(0.95-\)confidence interval is given by: \[ CI(95\%) = \bar{x}_n \pm T_{n-1}(0.975)\times \frac{\hat{s}_n}{\sqrt{n}},\] where \(T_{n-1}(0.975)\) is the 97.5% percentile of the T-distribution with \(n-1\) degrees of freedom.
More generally, a confidence interval of confidence level \(1-\alpha\) is an interval which has probability \(1-\alpha\) of containing the population mean \(\mu\). For a given values of \(\bar{x}_n\) and \(\hat{s}_n\), the \((1-\alpha)-\)confidence interval is given by: \[ CI(95\%) = \bar{x}_n \pm T_{n-1}\bigg(1 -\frac{\alpha}{2}\bigg)\times \frac{\hat{s}_n}{\sqrt{n}},\] where \(T_{n-1}\bigg(1 - \frac{\alpha}{2}\bigg)\) is the \((1 - \frac{\alpha}{2})-\)percentile of the T-distribution with \(n-1\) degrees of freedom.
In this exercise, we perform simulations that help us better understand how confidence intervals help find an interval range that is highly likely to contain the population mean of a population.
3(A) Import the dataset cdc.csv. This
dataset consists of 20,000 measurements of clinical variables. In this
analysis, we are interested in analyzing the variable
height. We consider these 20,000 observations as our total
population.
Create a vector heights which contains all of the 20,000
observations of this variables in the cdc dataset. Then,
create a variable mu, which contains the mean of
height, and a variable sigma containing the
standard deviation of all the observations.
Print the following quantities to console:
The value of mu.
The standard deviation sigma.
What is the average height of individuals in the common feet + inches notation? Print as a vector of 2 entries with feet and inches as its entries.
What proportion of individuals in the data are taller than 6 ft?
What proportion of individuals in the data are between or equal to 5 and 6 ft tall?
What proportion of individuals are strictly taller than
mu + 2 * sigma?
Add text to each of your prints describing what each printed quantity
represents. For example, for the first, you could print
"Population Mean = *insert value here*.
cdc <- read.csv("C:/Users/pplec/Downloads/cdc (1).csv")
## Your code here:
heights = cdc$height
mu = mean(heights) ##population mean= 67.1829
sigma = sd(heights)
print(paste("population mean=", mu))
## [1] "population mean= 67.1829"
print(paste("population sigma=", sigma))
## [1] "population sigma= 4.12595428536699"
vec_height= c(floor(mu/12),mu%%12)
print(paste("average height=",vec_height[1],"feet", vec_height[2], "inches"))
## [1] "average height= 5 feet 7.1829 inches"
mean(heights>72)
## [1] 0.10275
print(paste("proportion of heights > 72 inches : ",mean(heights>72)))
## [1] "proportion of heights > 72 inches : 0.10275"
mean(heights>=60 & heights<=72)
## [1] 0.88325
print(paste("proportion of heights between 5 and 6feet tall : ",mean(heights>=60 & heights<=72)))
## [1] "proportion of heights between 5 and 6feet tall : 0.88325"
mean( heights>(mu+2*sigma))
## [1] 0.01725
print(paste("proportion heights that are taller than 2 standard deviations from the mean : ", mean( heights>(mu+2*sigma))))
## [1] "proportion heights that are taller than 2 standard deviations from the mean : 0.01725"
3(B) As mentioned before, we assume that we do not have access to the full 20,000 observations of the dataset. Instead, imagine that we only have access to a random sample of 84 observations from this total dataset. We are interested in exploring how good of an approximation is \(\bar{x}_{84}\) to the true population mean \(\mu\) we calculated before. We here construct a 0.95-confidence interval using this \(n=84\) random sample.
Use the function sample() to gather a sample of 84
observations from the vector height with replacement. Then,
calculate the sample mean xbar and sample standard
deviation sdhat.
Use a base-R function to recover the 0.975-percentile of the
T-distribution. Finally, use the information gathered to define a vector
CI containing the lower/upper limits of the 95% confidence
interval for the sample gathered. Do the following:
"0.95-Confidence Interval: ( *lower-lim* , *upper-lim* )".True mean : *value of mu*.mu within the 0.95-CI: *answer*.#######################################
## DO NOT MODIFY
set.seed(100)
#######################################
#######################################
## Your code here:
height_exp = sample(x = heights, size = 84, replace = FALSE)
xbar= mean(height_exp)
sdhat= sd(height_exp)
print(height_exp)
## [1] 70 76 70 63 64 65 68 62 71 67 73 65 70 60 63 68 63 66 72 72 76 66 66 65 71
## [26] 63 62 73 72 69 70 64 72 65 67 66 63 72 63 63 73 63 67 62 63 69 66 62 70 66
## [51] 66 60 70 64 64 71 68 70 63 65 76 63 70 67 66 71 68 67 61 71 70 69 72 64 69
## [76] 64 67 63 60 67 65 67 62 70
print(xbar)
## [1] 66.9881
print(sdhat)
## [1] 3.922423
n=84
height_upper = xbar + qt(p = .975, df = 83)*sdhat/sqrt(n)
height_lower = xbar - qt(p = .975, df = 83)*sdhat/sqrt(n)
print(paste("0.95-Confidence Interval:(" ,height_lower ,",", height_upper , ")" ))
## [1] "0.95-Confidence Interval:( 66.1368773606768 , 67.8393131155137 )"
print(paste("True mean : ", mu))
## [1] "True mean : 67.1829"
print(paste("mu within the 0.95-CI :","Yes"))
## [1] "mu within the 0.95-CI : Yes"
3(C) Let us now study the following claim: A
0.95-confidence interval is a random interval, which has a 95% chance of
containing the true value of the population mean. For this, we
simulate 1000 different confidence intervals, and we see what proportion
of these intervals contain the true mean mu.
Create a 1000 x 2 matrix CIs. Each row will
correspond to a different simulation of a confidence interval.
For each row index ind, generate a random sample of
\(n = 84\) observations from the vector
height without replacement.
For each of these samples, calculate the confidence interval and
save its values in the row CIs[ind, ].
Calculate the proportion of generated intervals that contains the
true mean mu. Print it as
"Proportion of CIs containing mu = *calculated proportion*".
Comment below: what does this proportion represent? Answer below.
CIs = matrix(-1, nrow = 1000, ncol = 2)
set.seed(465)
for(ind in 1:1000){
height_exp = sample(x = heights, size = 84, replace = FALSE)
xbar= mean(height_exp)
sdhat= sd(height_exp)
n=84
height_upper = xbar + qt(p = .975, df = 83)*sdhat/sqrt(n)
height_lower = xbar - qt(p = .975, df = 83)*sdhat/sqrt(n)
CIs[ind,] = c(height_lower, height_upper)
}
mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.941
print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu = 0.941"
Answer here: The proportion represents the amount of times that mu is in the lower and upper bound of the interval. —
3(D) Now, lets visualize how the random confidence
intervals you generated distribute when compared to the true mean
mu. For this, create the following plot:
Create a 100 x 2 matrix CIs_red that contains the
first 100 confidence intervals you calculated before.
Create a vector of length 100, CIs_cols which
assigns color “blue” to intervals that contain mu, and
“red” to the intervals that do not contain mu.
Create a plot with limits xlim = c(0,100) and
ylim = c(64,70), and a horizontal line on the value
mu. (Hint: use argument type = "l").
Using the function segments(), add to this plot the
confidence intervals contained in the matrix CIs_red.
Display it such that the confidence interval CIs_red[ind, ]
is displayed vertically, with coordinates
(x0,y0) = (ind, CIs_red[ind,1]) and
(x1,y1) = (ind, CIs_red[ind,2]). In addition, color the
interval "red" if the interval does not contain
mu, and "blue" if it contains
mu.
Using print, calculate how many intervals in the
plot do not contain the true mean mu. Add text to provide
description in your print.
CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red")
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")
for(ind in 1:100){
segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])
}
mean(CIs_cols == "red")
## [1] 0.059
print(paste("The Proportion of Intervals not Containing Mu =", mean(CIs_cols == "red") ))
## [1] "The Proportion of Intervals not Containing Mu = 0.059"
3(E) Now, use a similar procedure from
3(C) to calculate a (100 x 2) matrix
CIs_n500 which contains 100 random 0.95-confidence
intervals, now with a sample size of \(n =
500\). Then, follow the same steps from 3(D) to
generate a confidence interval plot for these new intervals. Answer:
How do the 0.95-confidence intervals with \(n=500\) compare to those of \(n = 84\)?
Are there benefits to having a larger sample size in estimating
the mean mu?
CIs = matrix(-1, nrow = 100, ncol = 2)
set.seed(465)
n=500
for(ind in 1:100){
height_exp = sample(x = heights, size = n, replace = FALSE)
xbar= mean(height_exp)
sdhat= sd(height_exp)
height_upper = xbar + qt(p = .975, df = n-1)*sdhat/sqrt(n)
height_lower = xbar - qt(p = .975, df = n-1)*sdhat/sqrt(n)
CIs[ind,] = c(height_lower, height_upper)
}
mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.94
print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu = 0.94"
CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red")
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")
for(ind in 1:100){
segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])
}
3(F) Now, use a similar procedure from
3(C) to calculate a (100 x 2) matrix
CIs_c99 which contains 100 random 0.99-confidence
intervals, now with a sample size of \(n =
84\). Then, follow the same steps from 3(D) to
generate a confidence interval plot for these new intervals. Answer:
How do the 0.99-confidence intervals with \(n=84\) compare to those of \(n = 84\) with confidence 0.95?
What are the benefits of increasing the confidence of an interval? Are there any drawbacks?
CIs = matrix(-1, nrow = 100, ncol = 2)
set.seed(465)
n=84
for(ind in 1:100){
height_exp = sample(x = heights, size = n, replace = FALSE)
xbar= mean(height_exp)
sdhat= sd(height_exp)
height_upper = xbar + qt(p = .995, df = n-1)*sdhat/sqrt(n)
height_lower = xbar - qt(p = .995, df = n-1)*sdhat/sqrt(n)
CIs[ind,] = c(height_lower, height_upper)
}
mean(CIs[,1] < mu & CIs[,2] > mu)
## [1] 0.99
print(paste("proportion of CIs containing mu = ", mean(CIs[,1] < mu & CIs[,2] > mu) ))
## [1] "proportion of CIs containing mu = 0.99"
CIs_red = CIs[1:100,]
CIs_cols = ifelse(CIs[,1] < mu & CIs[,2] > mu, "blue", "red")
CIs[ind,] = c(height_lower, height_upper)
plot(y = rep(mu,101), x= 0:100 ,xlim= c(0,100), ylim= c(64,70), type= "l")
for(ind in 1:100){
segments(x0 = ind,x1= ind ,y0 = CIs_red[ind,1],y1 =CIs_red[ind,2] ,col= CIs_cols[ind])
}
Answer here:
Linear regression is a useful tool to examine the relationship between two variables, specifically how an independent variable or predictor (i.e. \(X\)) affects the dependent variable (i.e. \(Y\)). We can also examine the relationships between multiple \(X\) and \(Y\), but we will focus our attention to only using one predictor. Given \(n\) observations of the form \((x_1, y_1),(x_2, y_2),\ldots, (x_n, y_n),\), a linear regression model assumes \(x_i\) and \(y_i\) are related through the linear equation: \[ y_i = \beta_2 x_i + \beta_1 + \epsilon_i, \hspace{5mm} i = 1,2,\ldots,n \] where \(\beta_2\) represents the effect that the variable \(X\) has on the dependent variable \(Y\), \(\beta_1\) is the \(y\)-intercept, and \(\epsilon_1,\epsilon_2,\ldots \epsilon_n\) represent the error terms, with a standard deviation \(\sigma\). By defining the error vector \(\vec{\epsilon} = (\epsilon_1,\epsilon_2,\ldots \epsilon_n)^T\), we can condense it into the vectorized expression \[ Y = \beta_2 X + \beta_1 + \epsilon. \] We can further simplify the expression to matrix form. We define the column \(2 \times 1\) vector \(\vec{\mathbb{\beta}} = (\beta_1, \beta_2)^T\), and the \(n\times 2\) matrix \(\mathbf{X} = [\vec{1}, X]\), where \(\vec{1} = (1,1,...,1)^T\), and the \(n\times 1\). Then, the expression is equivalent to,
\[ Y = \mathbf{X} \cdot \vec{\beta} + \vec{\epsilon}, \]
where \(\vec{\beta}\) represents the change in \(Y\) for each \(X\) and includes the y-intercept, while \(\vec{\epsilon}\) represents the error between the true value and predicted value, and \(\cdot\) is the matrix multiplication operation. (I suggest you spend some time doing the matrix multiplication to verify that these three expressions represent the same model).
When doing linear regression, our goal is to find an estimate for \(\vec{\beta}\) that reduces the sum of squares between the true value \(Y\) and its linear approximation \(\hat{Y} = \hat\beta_2 X + \hat\beta_1\), better known as least-squares approximation. The least-squares estimator of \(\vec{\beta}\) can be calculated by the following: \[ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y \]
If a predictor is deemed statistically significant, we expect the \(\hat{\beta}\) coefficient related to the predictor in question to be much different from 0. We will walk through a problem to go through both the manual and R calculations in linear regression.
4(A) Load the following dataset into your
environment, which details Major League Baseball player hitting
statistics for the 2010 season. Create a scatterplot between the number
of walks and number of runs, labeled as walk and
run. Make sure the following elements are included:
walk as your independent variable.Afterward, describe the relationship between the two variables. Is there a linear relationship?
source("https://www.openintro.org/data/R/mlbbat10.R")
Answer here:
4(B) Similar to the work done in problem #5 in HW2, calculate the linear model coefficients \(\hat{\beta}\) by hand using matrix multiplications. Make sure you obtain a 2-D vector. Recreate the scatterplot, now with the least-squares line. Does it fit the data well?
Answer here:
4(C) In a linear regression, we want to calculate the size of the errors using the following formula: \[ \hat{\sigma} = \sqrt{\frac{RSS}{n-2}} \]
The residual sum of squares (RSS) can be found with the following formula: \[ RSS = \sum_{i = 1}^n (Y_i - \hat{Y}_i)^2 \]
Do the following:
4(D) Usually, we also want to calculate the proportion of variance explained by the linear model as a value from 0 to 1. The closer to 1, the more of the variation in \(Y\) can be explained by \(X\). This can be done with the following formula: \[ R^2 = \frac{(n - 1) * var(y) - RSS}{(n - 1) * var(y)}. \]
Manually, find \(R^2\) and print the result. Is this a high number?
Answer here:
4(E) Usually, in a linear regression problem, one wants to test \(H_0: \beta_1 = 0\) vs. \(H_a: \beta_1 \neq 0\). To test this, calculate the following test statistic: \[ T = \frac{\hat{\beta_1}}{\sigma * \sqrt{[(X^T X)^{-1}]_{22}}}. \]
4(F) If the null distribution holds, we expect \(T \sim T_{n-2}\). Use this to calculate the \(p\)-value for this test.
4(G) Linear regression can be performed
automatically in R, using the function lm(). Use the
function lm() on the two variables to calculate the linear
regression coefficients and assign it to mod. Display the
table of results with the function summary() on
mod. Do the results you obtained with the linear regression
in R match those you obtained with your by-hand calculations? Is the
linear relationship significant at an \(\alpha\) level of 0.05?
Answer here: