Regression Analysis: Homework Assignment 02

Question 5

In a recent, exciting, but also controversial Science article, Tomasetti and Vogelstein attempt to explain why cancer incidence varies drastically across tissues (e.g. why one is much more likely to develop lung cancer rather than pelvic bone cancer). The authors show that a higher average lifetime risk for a cancer in a given tissue correlates with the rate of replication of stem cells in that tissue. The main inferential tool for their statistical analysis was a simple linear regression, which we will replicate here.

You can download the dataset as follows:

tomasetti = read.csv("https://stats191.stanford.edu/data/Tomasetti.csv")
head(tomasetti)

##                                            Type   Risk     Lscd       Cluster
## 1                        Acute myeloid leukemia 0.0041 1.30e+11   Replicative
## 2                          Basal cell carcinoma 0.3000 3.55e+12 Deterministic
## 3                  Chronic lymphocytic leukemia 0.0052 1.30e+11   Replicative
## 4                     Colorectal adenocarcinoma 0.0480 1.17e+12 Deterministic
## 5            Colorectal adenocarcinoma with FAP 1.0000 1.17e+12 Deterministic
## 6 Colorectal adenocarcinoma with Lynch syndrome 0.5000 1.17e+12 Deterministic

The dataset contains information about 31 tumour types. The Lscd (Lifetime stem cell divisions) column refers to the total number of stem cell divisions during the average lifetime, while Risk refers to the lifetime risk for cancer of that tissue type.

Fit a simple linear regression model to the data with log(Risk) as the response variable and log(Lscd) as the predictor variable. (2 points)

tomasetti = read.csv(“https://stats191.stanford.edu/data/Tomasetti.csv”) str(tomasetti) head(tomasetti) Risk.lm<-lm(log(Risk)~log(Lscd),data=tomasetti) print(Risk.lm) coeffs<-coefficients(Risk.lm) coeffs

Plot the estimated regression line and the data. (2 points)

tomasetti = read.csv(“https://stats191.stanford.edu/data/Tomasetti.csv”) str(tomasetti) head(tomasetti) Risk.lm<-lm(log(Risk)~log(Lscd),data=tomasetti) Risk.lm msum<-summary(Risk.lm) msum plot(log(Risk)~log(Lscd),tomasetti, col=“blue”,main=“linear regression model”,xlab=“log(Lscd)”,ylab=“log(Risk)”) fit<-lm(log(Risk)~log(Lscd),data=tomasetti) abline(fit,col=“red”)

Add upper and lower 95% prediction bands for the regression line on the plot, using predict. That is, produce one line for the upper limit of each interval over a sequence of densities, and one line for the lower limits of the intervals. (2 points)

plot((log(Risk))~log(Lscd),data=tomasetti,pch=16,col=‘red’) model2=lm(log(Risk)~log(Lscd),data=tomasetti) fit<-lm(log(Risk)~log(Lscd),data=tomasetti) abline(fit,col=“red”) data.pre=data.frame(predict(model2,interval=‘prediction’)) lines(data.pre$lwr~log(tomasetti$Lscd),col=‘blue’,lty=2) lines(data.pre$upr~log(tomasetti$Lscd),col=‘blue’,lty=2)

Interpret the above bands at a Lscd of $10^{10}$. (2 points)
Test whether the slope in this regression is equal to 0 at level $\alpha = 0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value. (2 points)
What are assumptions you made for question 5. (2 points)
Give a 95% confidence interval for the slope of the regression line. (2 points)

summary(Risk.lm) confint(Risk.lm,‘log(Lscd)’,level = 0.95)

Report the $R^2$ of the model. (2 points)

msum<-summary(Risk.lm) msum msum$r.squared

Report an estimate of the variance of the errors in the model. (2 points)
Provide an interpretation of the $R^2$ you calculated above. (2 points)

Question 6

From our textbook page 51, Exercise 2.9.

Let $Y$ and $X$ denote the labor force participation rate of women in 1972 and 1968, respectively, in each of 19 cities in the United States. The regression output for this data set is shown in the following table. It was also found that ${\rm SSR} = 0.0358$ and ${\rm SSE} = 0.0544$. Suppose that the model $Y = \beta_0 + \beta_1 X + \varepsilon$ satisfies the usual regression assumptions.

In this table s.e. is the standard error of the estimate, t-Test is the value of the test statistics under the null hypothesis, p-value is the $p$-value of the test.

Compute ${\rm Var}(Y)$ and ${\rm Cor}(Y,~ X)$. (2 points)

SSR <- 0.0358
SSE <- 0.0544
SST=SSR+SSE
n <- 19
varY=SST/(n-1)
R2 <- 0.397
Coryx=R2^0.5
cbind(varY,Coryx)

##             varY     Coryx
## [1,] 0.005011111 0.6300794

Suppose participation rate of women in 1968 in a given city is 45%. What is the estimated participation rate of women in 1972 for the same city? (2 points)

Cs <- 0.203311
Cx <- 0.656040
WER=Cs+Cx*0.45
WER

## [1] 0.498529

Suppose further that the mean and variance of the participation rate of women in 1968 are 0.5 and 0.005, respectively. Construct the 95% confidence interval for the estimate in 2. (2 points)

varx <- 0.005
meanx <- 0.5
sigma <- 0.0566
s.e.y <- sigma*(1+1/n+(0.45-meanx)^2/(varx*n)^0.5)
WER_lwr <- WER-s.e.y*qt(0.025,n-2,lower.tail = F)
WER_upr <- WER+s.e.y*qt(0.025,n-2,lower.tail = F)
cbind(WER_lwr,WER_upr)

##        WER_lwr   WER_upr
## [1,] 0.3718598 0.6251982

Construct the 95% confidence interval for the slope of the true regression line $\beta_1$. (2 points)

s.e.Cx <- 0.1961
beta.lwr <- Cx-s.e.Cx*qt(0.025,n-2,lower.tail = F)
beta.upr <- Cx+s.e.Cx*qt(0.025,n-2,lower.tail = F)
cbind(beta.lwr,beta.upr)

##       beta.lwr beta.upr
## [1,] 0.2423052 1.069775

Test the hypothesis: $H_0:~ \beta_1 = 1$ versus $H_a:~ \beta_1 > 1$ at the 5% significance level. (2 points)

tc <- (1-Cx)/s.e.Cx
t <- qt(0.025,n-2,lower.tail = F)
cbind(tc,t)

##            tc        t
## [1,] 1.754003 2.109816

If $Y$ and $X$ were reversed in the above regression, what would you expect $R^2$ to be? (2 points)

Question 7

From our textbook page 51, Exercise 2.10.

One may wonder if people of similar heights tend to marry each other. For this purpose, a sample of newly married couples was selected. Let $X$ be the height of the husband and $Y$ be the height of the wife. The heights (in centimeters) of husbands and wives are:

Heights = read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P052.txt", header = TRUE)
head(Heights)

##   Husband Wife
## 1     186  175
## 2     180  168
## 3     160  154
## 4     186  166
## 5     163  162
## 6     172  152

Compute the covariance between the heights of the husbands and wives. (2 points)

Heights = read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P052.txt", header = TRUE)
head(Heights)

##   Husband Wife
## 1     186  175
## 2     180  168
## 3     160  154
## 4     186  166
## 5     163  162
## 6     172  152

cat("df数据框行数为：",nrow(Heights),"\ndf数据框列数为：",ncol(Heights),"\n",sep="")

## df数据框行数为：96
## df数据框列数为：2

COV=0
EX=mean(Heights$Husband)
EY=mean(Heights$Wife)
for(j in 1:96){
 COV <- COV+(Heights$Husband[j]-EX)*(Heights$Wife[j]-EY)/(95)
}
COV

## [1] 69.41294

cov(Heights$Husband,Heights$Wife)

## [1] 69.41294

print("The covariance between husband and wife height was 69.41294
")

## [1] "The covariance between husband and wife height was 69.41294\n"

What would the covariance be if heights were measured in inches rather than in centimeters? (2 points)

a=Heights$Husband/30.84 b=Heights$Wife/30.84 COV=0 EXX=mean(a) EYY=mean(b) for(j in 1:96){ COV <- COV+(a[j]-EXX)*(b[j]-EYY)/(95) } COV cov(a,b)

Compute the correlation coefficient between the heights of the husband and wife. (2 points)

R_pearson1 <- cor(Heights$Husband,Heights$Wife,method = ‘pearson’)#求解相关系数 R_pearson1 R_pearson2 <- cor(Heights$Husband,Heights$Wife,method = ‘spearman’) R_pearson2 R_pearson3 <- cor(Heights$Husband,Heights$Wife,method = ‘kendall’) R_pearson3

What would the correlation be if heights were measured in inches rather than in centimeters? (2 points)

a=Heights$Husband/30.84 b=Heights$Wife/30.84 R_pearson1 <- cor(a,b,method = ‘pearson’)#求解相关系数 R_pearson1 R_pearson2 <- cor(a,b,method = ‘spearman’) R_pearson2 R_pearson3 <- cor(a,b,method = ‘kendall’) R_pearson3

What would the correlation be if every man married a woman exactly 5 centimeters shorter than him? (2 points)
We wish to fit a regression model relating the heights of husbands and wives. Which one of the two variables would you choose as the response variable? Justify your answer. (2 points)

library(ggplot2) heights_fig = ggplot(Heights, aes(Heights$Husband, Heights$Wife)) + geom_point() + theme_bw() heights_fig Height.lm = lm(Heights$Husband ~ Heights$Wife, data =Heights) print(Height.lm) beta.1.hat = cov(Heights$Wife, Heights$Husband) / var(Heights$Wife) beta.0.hat = mean(Heights$Husband) - beta.1.hat * mean(Heights$Wife) print(c(beta.0.hat, beta.1.hat)) print(coef(Heights.lm)) sigma.hat = sqrt(sum(resid(Heights.lm)^2) / Heights.lm$df.resid) c(sigma.hat, sqrt(sum((Heights$Husband - predict(Heights.lm))^2) /Heights.lm$df.resid)) summary(Heights.lm)

Using your choice of the response variable in 6, test the null hypothesis that the slope is zero. (2 points)

beta.1.hat = cov(Heights$Wife, Heights$Husband) / var(Heights$Wife) beta.0.hat = mean(Heights$Husband) - beta.1.hat * mean(Heights$Wife) print(c(beta.0.hat, beta.1.hat)) print(coef(Heights.lm)) sigma.hat = sqrt(sum(resid(Heights.lm)^2) / Heights.lm$df.resid) c(sigma.hat, sqrt(sum((Heights$Husband - predict(Heights.lm))^2) /Heights.lm$df.resid)) summary(Heights.lm) SE.beta.1.hat = (sigma.hat * sqrt(1 / sum((Heights$Wife - mean(Heights$Wife))^2))) Tstat = (beta.1.hat - 0) / SE.beta.1.hat data.frame(beta.1.hat, SE.beta.1.hat, Tstat) 2(1 - pt(Tstat,Heights.lm$df.resid)) alpha = 0.05 n = nrow(Heights) n qt(1-0.5 * alpha, n-2) qnorm(1 - 0.5*alpha) L = beta.1.hat - qt(0.975, Heights.lm$df.resid) SE.beta.1.hat U = beta.1.hat + qt(0.975, Heights.lm$df.resid) * SE.beta.1.hat data.frame(L, U) confint(Heights.lm)

Using your choice of the response variable in 6, test the null hypothesis that the intercept is zero. (2 points)

beta.1.hat = cov(Heights$Wife, Heights$Husband) / var(Heights$Wife) beta.0.hat = mean(Heights$Husband) - beta.1.hat * mean(Heights$Wife) print(c(beta.0.hat, beta.1.hat)) print(coef(Heights.lm)) sigma.hat = sqrt(sum(resid(Heights.lm)^2) / Heights.lm$df.resid) c(sigma.hat, sqrt(sum((Heights$Husband - predict(Heights.lm))^2) /Heights.lm$df.resid)) summary(Heights.lm) SE.beta.0.hat = (sigma.hat * sqrt(1/n + (mean(Heights$Wife))^2/sum((Heights$Wife - mean(Heights$Wife))^2))) Tstat = (beta.0.hat - 0) / SE.beta.0.hat data.frame(beta.0.hat, SE.beta.0.hat, Tstat) 2*(1 - pt(Tstat,Heights.lm$df.resid)) alpha = 0.05 n = nrow(Heights) n qt(1-0.5 * alpha, n-2) qnorm(1 - 0.5alpha) L = beta.0.hat - qt(0.975, Heights.lm$df.resid) * SE.beta.0.hat U = beta.0.hat + qt(0.975, Heights.lm$df.resid) SE.beta.0.hat data.frame(L, U) confint(Heights.lm)

Question 8

In this problem, we will investigate what happens when the assumptions of the simple linear regression model do not hold. When generating data below, set X to be equally spaced between 0 and 1 (i.e. X = seq(0, 1, by=0.01)) and use the regression function

\[ Y = 1 + 2 \cdot X + \varepsilon \]

Write a function (call the function as generateTstat) to generate data from the simple linear regression model with regression function as above and normally distributed errors $\varepsilon \sim N\left( 0 ,~ \sigma^2 \right)$ (can use $\sigma^2=1$), returning the $T$-statistic for testing whether the slop of the regression line is equal to 2. [That is, testing $H_0 :~ \beta_1=2$ versus $H_1 :~ \beta_1 \not= 2$.] (2 points)

The function arguments should be values for $X$ and slope of the regression line beta1.

generateTstat = function(X, beta1){
    # Y = write Y as a function of X and error
    # fit = fit regression line using lm
    # beta1hat = compute least squares estimate of slope using summary(fit)$coefficient[2,1]
    # se_beta1hat = compute standard error of slope estimate using summary(fit)$coefficients[2,2]
    # Tstat = Compute the T-statistic using the appropriate formula (for testing H0: beta_1 = 2
    # return(Tstat)
}

Using your function, run a simulation with 5000 repetitions to see if the $T$-statistic has distribution close to a $T$ distribution. How many degrees of freedom should it have (consider the length of $X$ to answer this question)? (2 points)

# X = use seq() to generate 100 X between 0 and 1
# t_stat_vec = use replicate() to compute 5000 T-statistic values
# Plot the distribution of the T-statistic and the T distribution

In part I, how often is your $T$ statistic larger than the usual 5% threshold? (2 points)

# thresholod = find the threshold for testing H0: beta_1 = 2 versus H_1: beta_1 not equal to 2
# (degrees of freedom depends on the length of X)
# Find how many of absolute t_stat_vec is greater than the threshold

Write a new function with the same regression function but errors that are t-distributed using, say, rt with 5 degrees of freedom to generate errors. Repeat I and II in part 1. Does the $T$-statistic still have close to a T distribution? How often is your $T$ statistic larger than the usual 5% threshold？(2 points)
Write a new function with the same regression function but errors that do not have the same variance though they are normally distributed. Construct errors such that the variance of the $i$-th error is $1+X_i$ (recall that $X$ is equally spaced over interval 0 to 1 and you can specify in rnorm what is the standard deviation of error). Plot the variance of error as a function of $X$. Repeat I and II in part 1. Does the $T$-statistic still have close to a $T$ distribution? How often are your $T$ statistic larger than the usual 5% threshold? (2 points)
Write a new function with the same regression function but errors that do not have the same variance though they are normally distributed. Exaggerate the effect of non-constant variance by making the variance of errors $\exp \left( 1 + 5 \cdot X_i \right)$. Plot the variance as a function of $X$. Repeat I and II in part 1. Does the $T$-statistic still have close to a $T$ distribution? How often are your $T$ statistics laarger than the usual 5% threshold? (2 points)
Write a new function with the same regression function but errors that are not independent. Do this by first generating a vector of errors error and the returning a new vector whose first entry is error[1] but for $i>1$ the $i$-th entry is error[i-1] + error[i]. Repeat I and II in part 1. Does the $T$-statistic still have close to a $T$ distribution? How often is your $T$ statistic larger than the usual 5% threshold? (2 points)
Summarize your findings in questions 1-5. Which of the departures from the assumptions for the error term of the simple linear regression model seem important? (2 points)

Regression Analysis: Homework Assignment 02

王桂兴

2022-10-08

Question 5

Question 6

Question 7

Question 8