In a recent, exciting, but also controversial Science article, Tomasetti and Vogelstein attempt to explain why cancer incidence varies drastically across tissues (e.g. why one is much more likely to develop lung cancer rather than pelvic bone cancer). The authors show that a higher average lifetime risk for a cancer in a given tissue correlates with the rate of replication of stem cells in that tissue. The main inferential tool for their statistical analysis was a simple linear regression, which we will replicate here.
You can download the dataset as follows:
tomasetti = read.csv("https://stats191.stanford.edu/data/Tomasetti.csv")
head(tomasetti)
## Type Risk Lscd Cluster
## 1 Acute myeloid leukemia 0.0041 1.30e+11 Replicative
## 2 Basal cell carcinoma 0.3000 3.55e+12 Deterministic
## 3 Chronic lymphocytic leukemia 0.0052 1.30e+11 Replicative
## 4 Colorectal adenocarcinoma 0.0480 1.17e+12 Deterministic
## 5 Colorectal adenocarcinoma with FAP 1.0000 1.17e+12 Deterministic
## 6 Colorectal adenocarcinoma with Lynch syndrome 0.5000 1.17e+12 Deterministic
The dataset contains information about 31 tumour types. The Lscd (Lifetime stem cell divisions) column refers to the total number of stem cell divisions during the average lifetime, while Risk refers to the lifetime risk for cancer of that tissue type.
(2 points)(2 points)(2 points)Interpret the above bands at a Lscd of \(10^{10}\). (2 points)
Test whether the slope in this regression is equal to 0 at level
\(\alpha = 0.05\). State the null
hypothesis, the alternative, the conclusion and the \(p\)-value. (2 points)
What are assumptions you made for question 5.
(2 points)
Give a 95% confidence interval for the slope of the regression
line. (2 points)
(2 points)Report an estimate of the variance of the errors in the model.
(2 points)
Provide an interpretation of the \(R^2\) you calculated above.
(2 points)
From our textbook page 51, Exercise 2.9.
Let \(Y\) and \(X\) denote the labor force participation rate of women in 1972 and 1968, respectively, in each of 19 cities in the United States. The regression output for this data set is shown in the following table. It was also found that \({\rm SSR} = 0.0358\) and \({\rm SSE} = 0.0544\). Suppose that the model \(Y = \beta_0 + \beta_1 X + \varepsilon\) satisfies the usual regression assumptions.
In this table s.e. is the standard error of the estimate, t-Test is the value of the test statistics under the null hypothesis, p-value is the \(p\)-value of the test.
(2 points)SSR <- 0.0358
SSE <- 0.0544
SST=SSR+SSE
n <- 19
varY=SST/(n-1)
R2 <- 0.397
Coryx=R2^0.5
cbind(varY,Coryx)
## varY Coryx
## [1,] 0.005011111 0.6300794
(2 points)Cs <- 0.203311
Cx <- 0.656040
WER=Cs+Cx*0.45
WER
## [1] 0.498529
(2 points)varx <- 0.005
meanx <- 0.5
sigma <- 0.0566
s.e.y <- sigma*(1+1/n+(0.45-meanx)^2/(varx*n)^0.5)
WER_lwr <- WER-s.e.y*qt(0.025,n-2,lower.tail = F)
WER_upr <- WER+s.e.y*qt(0.025,n-2,lower.tail = F)
cbind(WER_lwr,WER_upr)
## WER_lwr WER_upr
## [1,] 0.3718598 0.6251982
(2 points)s.e.Cx <- 0.1961
beta.lwr <- Cx-s.e.Cx*qt(0.025,n-2,lower.tail = F)
beta.upr <- Cx+s.e.Cx*qt(0.025,n-2,lower.tail = F)
cbind(beta.lwr,beta.upr)
## beta.lwr beta.upr
## [1,] 0.2423052 1.069775
(2 points)tc <- (1-Cx)/s.e.Cx
t <- qt(0.025,n-2,lower.tail = F)
cbind(tc,t)
## tc t
## [1,] 1.754003 2.109816
(2 points)From our textbook page 51, Exercise 2.10.
One may wonder if people of similar heights tend to marry each other. For this purpose, a sample of newly married couples was selected. Let \(X\) be the height of the husband and \(Y\) be the height of the wife. The heights (in centimeters) of husbands and wives are:
Heights = read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P052.txt", header = TRUE)
head(Heights)
## Husband Wife
## 1 186 175
## 2 180 168
## 3 160 154
## 4 186 166
## 5 163 162
## 6 172 152
(2 points)Heights = read.table("https://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P052.txt", header = TRUE)
head(Heights)
## Husband Wife
## 1 186 175
## 2 180 168
## 3 160 154
## 4 186 166
## 5 163 162
## 6 172 152
cat("df数据框行数为:",nrow(Heights),"\ndf数据框列数为:",ncol(Heights),"\n",sep="")
## df数据框行数为:96
## df数据框列数为:2
COV=0
EX=mean(Heights$Husband)
EY=mean(Heights$Wife)
for(j in 1:96){
COV <- COV+(Heights$Husband[j]-EX)*(Heights$Wife[j]-EY)/(95)
}
COV
## [1] 69.41294
cov(Heights$Husband,Heights$Wife)
## [1] 69.41294
print("The covariance between husband and wife height was 69.41294
")
## [1] "The covariance between husband and wife height was 69.41294\n"
(2 points)(2 points)(2 points)What would the correlation be if every man married a woman
exactly 5 centimeters shorter than him? (2 points)
We wish to fit a regression model relating the heights of
husbands and wives. Which one of the two variables would you choose as
the response variable? Justify your answer.
(2 points)
(2 points)(2 points)In this problem, we will investigate what happens when the assumptions of the simple linear regression model do not hold. When generating data below, set X to be equally spaced between 0 and 1 (i.e. X = seq(0, 1, by=0.01)) and use the regression function
\[ Y = 1 + 2 \cdot X + \varepsilon \]
(2 points)The function arguments should be values for \(X\) and slope of the regression line beta1.
generateTstat = function(X, beta1){
# Y = write Y as a function of X and error
# fit = fit regression line using lm
# beta1hat = compute least squares estimate of slope using summary(fit)$coefficient[2,1]
# se_beta1hat = compute standard error of slope estimate using summary(fit)$coefficients[2,2]
# Tstat = Compute the T-statistic using the appropriate formula (for testing H0: beta_1 = 2
# return(Tstat)
}
(2 points)# X = use seq() to generate 100 X between 0 and 1
# t_stat_vec = use replicate() to compute 5000 T-statistic values
# Plot the distribution of the T-statistic and the T distribution
(2 points)# thresholod = find the threshold for testing H0: beta_1 = 2 versus H_1: beta_1 not equal to 2
# (degrees of freedom depends on the length of X)
# Find how many of absolute t_stat_vec is greater than the threshold
Write a new function with the same regression function but errors
that are t-distributed using, say, rt with 5 degrees of
freedom to generate errors. Repeat I and II in part 1. Does the \(T\)-statistic still have close to a T
distribution? How often is your \(T\)
statistic larger than the usual 5%
threshold?(2 points)
Write a new function with the same regression function but errors
that do not have the same variance though they are normally distributed.
Construct errors such that the variance of the \(i\)-th error is \(1+X_i\) (recall that \(X\) is equally spaced over interval 0 to 1
and you can specify in rnorm what is the standard
deviation of error). Plot the variance of error as a function of \(X\). Repeat I and II in part 1. Does the
\(T\)-statistic still have close to a
\(T\) distribution? How often are your
\(T\) statistic larger than the usual
5% threshold? (2 points)
Write a new function with the same regression function but errors
that do not have the same variance though they are normally distributed.
Exaggerate the effect of non-constant variance by making the variance of
errors \(\exp \left( 1 + 5 \cdot X_i
\right)\). Plot the variance as a function of \(X\). Repeat I and II in part 1. Does the
\(T\)-statistic still have close to a
\(T\) distribution? How often are your
\(T\) statistics laarger than the usual
5% threshold? (2 points)
Write a new function with the same regression function but errors
that are not independent. Do this by first generating a vector of errors
error and the returning a new vector whose first entry
is error[1] but for \(i>1\) the \(i\)-th entry is error[i-1] +
error[i]. Repeat I and II in part 1. Does the \(T\)-statistic still have close to a \(T\) distribution? How often is your \(T\) statistic larger than the usual 5%
threshold? (2 points)
Summarize your findings in questions 1-5. Which of the departures
from the assumptions for the error term of the simple linear regression
model seem important? (2 points)