This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
myIllustration <- function(data_generating_process, beta_1 = 0, n = 20, nTrials = 1000){
Tstats = numeric(nTrials)
for(i in 1:nTrials){
X = seq(-1, 1, length.out = n)
Y = data_generating_process(X, beta_1)
Tstats[i] <- summary(lm(Y ~ X))$coefficient[2,3]
}
DF <- data.frame(t = Tstats)
par(mfrow = c(2,2))
plot(lm(Y ~ X))
cutoff <- qt(0.975, df = n-2)
library(ggplot2)
# overlay histogram and normal density
ggplot(DF, aes(x = t)) +
geom_histogram(aes(y = stat(density)), bins = 30) +
geom_density(lwd = 2, col = "blue") +
stat_function(
fun = dt,
args = list(df = n-2),
lwd = 2,
col = 'red'
) +
annotate("rect", xmin = -Inf, xmax = -cutoff, ymin=-Inf, ymax=+Inf, alpha = .2, fill = "orange")+
annotate("rect", xmin = cutoff, xmax = Inf, ymin=-Inf, ymax=+Inf, alpha = .2, fill = "orange") +
labs(title=paste("Falsely reject the null hypothesis for about ",
100* mean(abs(Tstats) > cutoff), "% of the times.", sep = ''))
}
Part 1 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, X^2)
set.seed(321)
mild_unequal_var_model <- function(X, beta_1){
epsilon = rnorm(length(X)) * X
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(mild_unequal_var_model)
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Part 2 True model: Yi = 0.5 + 0X1 + Ei, where Ei ∼ N(0, exp(1+10X)^2)
set.seed(321)
strong_unequal_var_model <- function(X, beta_1){
epsilon = rnorm(length(X)) * exp(1+10*X)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(strong_unequal_var_model)
Part 3 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ t(0, X^2)
No model assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have a few visible departure from normality as shown in the “Normal Q-Q” plot but not enough to invalidate the assumption of normality. - There is nothing in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 5% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 5%)
set.seed(321)
t_error_model <- function(X, beta_1){
epsilon = rt(length(X), df = 2)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(t_error_model)
Part 4 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ Cauchy(0, X^2)
Normality assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is nothing in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 4% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 4%)
set.seed(321)
cauchy_error_model <- function(X, beta_1){
epsilon = rcauchy(length(X))
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(cauchy_error_model)
Part 5 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, exp(0.5)^2)
Normality assumption and constant variance assumption has been violated. - There appears to be a quadratic pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is a quadratic pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 4% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 4%)
set.seed(321)
skewed_error_model <- function(X, beta_1){
epsilon = exp(rnorm(length(X))) - exp(0.5)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(skewed_error_model)
Part 6 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ AR(0, e)
Normality and constant variance assumption has been violated. - There is a quadratic visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is a pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density are not consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 16% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 16%)
set.seed(321)
dependent_error_model <- function(X, beta_1){
n = length(X)
epsilon = rnorm(n)
epsilon[2:n] = epsilon[1:(n-1)] + epsilon[2:n]
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(dependent_error_model)
Part 7 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, (x+2X2)2)
Constant variance assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot but not enough to validate the assumption. - There is a pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 2% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 2%)
set.seed(321)
quadratic_model <- function(X, beta_1){
epsilon = rnorm(length(X))
Y = 0.5 + beta_1 * X + 2*X^2 + epsilon
return(Y)
}
myIllustration(quadratic_model)
Question 2
state.data = data.frame(state.x77)
head(state.data)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
library(GGally)
Part 1
state_pairs <- state.data[c(-4,-5,-7,-8)]
ggpairs(data = state_pairs)
Part 2 Population and HS.Grad are statistically significant at the 0.05 level.
state_model <- lm(data = state_pairs, Income ~ .)
summary(state_model)
##
## Call:
## lm(formula = Income ~ ., data = state_pairs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -987.30 -213.67 -50.68 219.74 1430.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1940.74115 710.64526 2.731 0.008924 **
## Population 0.03786 0.01500 2.524 0.015129 *
## Illiteracy -73.57563 145.07584 -0.507 0.614470
## HS.Grad 45.57445 10.93778 4.167 0.000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 465.8 on 46 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.4254
## F-statistic: 13.09 on 3 and 46 DF, p-value: 2.612e-06
Part 3 - Residuals vs fitted and Scale-location plot may have a quadratic shape which may suggest a nonconstant variance among errors - The QQ plot suggests that the errors are not normally distributed. - Residuals vs Leverage plot suggests that Alaska is influential.
plot(state_model)
Part 4
plot(abs(dffits(state_model)), pch=23, bg='orange', cex=2, ylab="DFFITS")
state_pairs[which(abs(dffits(state_model)) > 0.5),]
## Population Income Illiteracy HS.Grad
## Alaska 365 6315 1.5 66.7
## California 21198 5114 1.1 62.6
## Mississippi 2341 3098 2.4 41.0
## New Mexico 1144 3601 2.2 55.2
## North Dakota 637 5087 0.8 50.3
## Utah 1203 4022 0.6 67.3
Part 5
plot(cooks.distance(state_model), pch=23, bg='orange', cex=2, ylab="Cook's distance")
state_pairs[which(cooks.distance(state_model) > 0.1),]
## Population Income Illiteracy HS.Grad
## Alaska 365 6315 1.5 66.7
## California 21198 5114 1.1 62.6
## New Mexico 1144 3601 2.2 55.2
## Utah 1203 4022 0.6 67.3
Part 6
plot(hatvalues(state_model), pch=23, bg='orange', cex=2, ylab='Hat values')
state_pairs[which(hatvalues(state_model) > 0.3),]
## Population Income Illiteracy HS.Grad
## California 21198 5114 1.1 62.6
Part 7 The model P-value decreased. The adjusted R-squared increased.
library(car)
## Loading required package: carData
outlierTest(state_model)
## rstudent unadjusted p-value Bonferroni p
## Alaska 3.905347 0.00031274 0.015637
state_prime <- state_pairs[-2,]
state_prime_model <- lm(data = state_prime, Income ~ .)
summary(state_prime_model)
##
## Call:
## lm(formula = Income ~ ., data = state_prime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -806.64 -223.85 -94.83 228.49 895.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2922.00458 669.84639 4.362 7.42e-05 ***
## Population 0.04463 0.01322 3.375 0.00153 **
## Illiteracy -248.72103 134.46160 -1.850 0.07092 .
## HS.Grad 29.75007 10.38054 2.866 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407 on 45 degrees of freedom
## Multiple R-squared: 0.4997, Adjusted R-squared: 0.4663
## F-statistic: 14.98 on 3 and 45 DF, p-value: 6.723e-07
Part 8 California, Hawaii, Mississippi, New Mexico, North Dakota, Utah
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
influence_table <- influence.measures(state_prime_model)[1]
influence_table <- influence_table$infmat
influence_table <- data.frame(influence_table)
influence_table %>% filter(abs(cook.d) > 0.1)
## dfb.1_ dfb.Pplt dfb.Illt dfb.HS.G dffit cov.r
## California 0.3256118 -0.7352332 -0.1341715 -0.3031662 -0.8233090 1.5772610
## Hawaii -0.8348990 -0.2902838 0.8618180 0.8356482 0.9955323 1.0639448
## Utah 0.4514076 0.2065283 -0.1867060 -0.5544962 -0.7396401 0.8100085
## cook.d hat
## California 0.1689473 0.3736192
## Hawaii 0.2356830 0.2305314
## Utah 0.1262576 0.1033623
influence_table %>% filter(abs(dffit) > 0.5)
## dfb.1_ dfb.Pplt dfb.Illt dfb.HS.G dffit cov.r
## California 0.32561184 -0.7352332 -0.1341715 -0.30316616 -0.8233090 1.5772610
## Hawaii -0.83489897 -0.2902838 0.8618180 0.83564818 0.9955323 1.0639448
## Mississippi 0.02179137 0.1533320 -0.3167685 0.02426796 -0.5255979 1.0236858
## New Mexico 0.44773506 0.1960150 -0.5618657 -0.42862890 -0.6172660 1.1633894
## North Dakota 0.40530843 -0.2451028 -0.3511830 -0.33465575 0.5658469 0.7624464
## Utah 0.45140764 0.2065283 -0.1867060 -0.55449617 -0.7396401 0.8100085
## cook.d hat
## California 0.16894731 0.3736192
## Hawaii 0.23568297 0.2305314
## Mississippi 0.06737841 0.1150322
## New Mexico 0.09390762 0.1880312
## North Dakota 0.07362874 0.0610797
## Utah 0.12625762 0.1033623
Question 3
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
hills.table <- hills
U = rep(0, nrow(hills.table))
U[21] = 1
hills.table$U = U
hills.table
## dist climb time U
## Greenmantle 2.5 650 16.083 0
## Carnethy 6.0 2500 48.350 0
## Craig Dunain 6.0 900 33.650 0
## Ben Rha 7.5 800 45.600 0
## Ben Lomond 8.0 3070 62.267 0
## Goatfell 8.0 2866 73.217 0
## Bens of Jura 16.0 7500 204.617 0
## Cairnpapple 6.0 800 36.367 0
## Scolty 5.0 800 29.750 0
## Traprain 6.0 650 39.750 0
## Lairig Ghru 28.0 2100 192.667 0
## Dollar 5.0 2000 43.050 0
## Lomonds 9.5 2200 65.000 0
## Cairn Table 6.0 500 44.133 0
## Eildon Two 4.5 1500 26.933 0
## Cairngorm 10.0 3000 72.250 0
## Seven Hills 14.0 2200 98.417 0
## Knock Hill 3.0 350 78.650 0
## Black Hill 4.5 1000 17.417 0
## Creag Beag 5.5 600 32.567 0
## Kildcon Hill 3.0 300 15.950 1
## Meall Ant-Suidhe 3.5 1500 27.900 0
## Half Ben Nevis 6.0 2200 47.633 0
## Cow Hill 2.0 900 17.933 0
## N Berwick Law 3.0 600 18.683 0
## Creag Dubh 4.0 2000 26.217 0
## Burnswark 6.0 800 34.433 0
## Largo Law 5.0 950 28.567 0
## Criffel 6.5 1750 50.500 0
## Acmony 5.0 500 20.950 0
## Ben Nevis 10.0 4400 85.583 0
## Knockfarrel 6.0 600 32.383 0
## Two Breweries 18.0 5200 170.250 0
## Cockleroi 4.5 850 28.100 0
## Moffat Chase 20.0 5000 159.833 0
Part 1 Both studentized residuals equal 0.20547852. The t value is 0.2054782 which is effectively 0.
model1 = lm(data = hills.table, time ~ dist + climb)
rstudent(model1)[21]
## Kildcon Hill
## 0.2054782
model2 = lm(data = hills.table, time ~ .)
summary(model2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.20070756 4.48509033 -2.0513985 4.875911e-02
## dist 6.22407066 0.61107644 10.1854207 2.062760e-11
## climb 0.01108789 0.00209136 5.3017600 9.039717e-06
## U 3.15212936 15.34045674 0.2054782 8.385419e-01
summary(model2)$coef[4,3]
## [1] 0.2054782
Part 2 Both are 0.4222
model3 <- anova(model1, model2)
summary(model3)
## Res.Df RSS Df Sum of Sq F
## Min. :31.00 Min. :6882 Min. :1 Min. :9.374 Min. :0.04222
## 1st Qu.:31.25 1st Qu.:6885 1st Qu.:1 1st Qu.:9.374 1st Qu.:0.04222
## Median :31.50 Median :6887 Median :1 Median :9.374 Median :0.04222
## Mean :31.50 Mean :6887 Mean :1 Mean :9.374 Mean :0.04222
## 3rd Qu.:31.75 3rd Qu.:6890 3rd Qu.:1 3rd Qu.:9.374 3rd Qu.:0.04222
## Max. :32.00 Max. :6892 Max. :1 Max. :9.374 Max. :0.04222
## NA's :1 NA's :1 NA's :1
## Pr(>F)
## Min. :0.8385
## 1st Qu.:0.8385
## Median :0.8385
## Mean :0.8385
## 3rd Qu.:0.8385
## Max. :0.8385
## NA's :1
(rstudent(model1)[21])^2
## Kildcon Hill
## 0.04222129
Part 3 and 4
hills.table2 <- hills.table[-25,]
model1 = lm(data = hills.table2, time ~ dist + climb)
model1$coefficients
## (Intercept) dist climb
## -9.15332986 6.22566833 0.01106511
model2$coefficients
## (Intercept) dist climb U
## -9.20070756 6.22407066 0.01108789 3.15212936
Question 4
set.seed(321)
X1 = runif(20,-1,1)
X2 = runif(20,-1,1)
X3 = runif(20,-1,1)
epsilon = rnorm(20)*0.5
Y = 0.5 + 3 * X1 + 4 * X2^2 + 0 * X3 + epsilon
toydata = data.frame(Y, X1, X2, X3)
model = lm(Y~., data = toydata)
Part 1 X1 appears to be the only predictor with a linear avplot, hence it should be included.
library(car)
avPlots(model, 'X1')
avPlots(model, 'X2')
avPlots(model, 'X3')
Part 2 X1 and x3 should be used as a linear term.
crPlots(model, "X1")
crPlots(model, "X2")
crPlots(model, "X3")
Question 5
tomasetti = read.csv("http://people.math.binghamton.edu/qiao/data501/data/Tomasetti.csv")
tomasetti$cluster_type <- ifelse(tomasetti$Cluster == 'Replicative', 1, 0)
tomasetti$cluster_type <- factor(tomasetti$cluster_type)
head(tomasetti)
## Type Risk Lscd Cluster
## 1 Acute myeloid leukemia 0.0041 1.30e+11 Replicative
## 2 Basal cell carcinoma 0.3000 3.55e+12 Deterministic
## 3 Chronic lymphocytic leukemia 0.0052 1.30e+11 Replicative
## 4 Colorectal adenocarcinoma 0.0480 1.17e+12 Deterministic
## 5 Colorectal adenocarcinoma with FAP 1.0000 1.17e+12 Deterministic
## 6 Colorectal adenocarcinoma with Lynch syndrome 0.5000 1.17e+12 Deterministic
## cluster_type
## 1 1
## 2 0
## 3 1
## 4 0
## 5 0
## 6 0
Part 1 Replicative: predicted log(Risk) = -17.96976 + 0.56117log(lscd) + -72.894271log(lscd) Deterministic: predicted log(Risk) = -17.96976 + 0.56117log(lscd)
tomasetti.lm <- lm(data = tomasetti, log(Risk) ~ log(Lscd) + Risk:cluster_type)
summary(tomasetti.lm)
##
## Call:
## lm(formula = log(Risk) ~ log(Lscd) + Risk:cluster_type, data = tomasetti)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1447 -1.0083 0.2467 0.7036 2.7768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.88106 1.73849 -9.135 9.53e-10 ***
## log(Lscd) 0.44557 0.08275 5.385 1.09e-05 ***
## Risk:cluster_type0 4.50801 1.69135 2.665 0.0128 *
## Risk:cluster_type1 -2.40291 67.47631 -0.036 0.9719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.559 on 27 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.701
## F-statistic: 24.44 on 3 and 27 DF, p-value: 7.429e-08
Part 2
library(ggplot2)
ggplot(tomasetti, aes(x = log(Lscd), y = log(Risk), color = cluster_type, shape = cluster_type)) +
geom_point(size = 3) +
geom_line(aes(y = -17.96976 + (0.56117*log(Lscd) + -72.89427*1*log(Lscd))), color = "green", linewidth = 0.5) +
geom_line(aes(y = -17.96976 + 0.56117*log(Lscd)), color = "purple", linewidth = 0.5)
Part 3 The p-value is 0.02501 which is less than .05. One would reject the null hypothesis that the reduced model is better.
tomasetti.original <- lm(data = tomasetti, log(Risk) ~ log(Lscd))
anova(tomasetti.lm, tomasetti.original)
## Analysis of Variance Table
##
## Model 1: log(Risk) ~ log(Lscd) + Risk:cluster_type
## Model 2: log(Risk) ~ log(Lscd)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 65.649
## 2 29 86.275 -2 -20.626 4.2414 0.02501 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question 6 1) There are 50 workers
Part 2
SSE = 338.449
SSR = 98.8313
SST = SSE + SSR
var_y = SST/49
var_y
## [1] 8.924088
Part 3
r2 = SSR/SST
r2
## [1] 0.2260136
Part 4 B0 is the intercept, and female coefficient B1 is the coefficient for males - females B0 + B1 is the male coefficient
Part 5
t_val <- qt(.975, df = 49)
B1 = -2.81
se = 0.75
lower = B1 - t_val*se
upper = B1 + t_val*se
print(paste(lower, upper))
## [1] "-4.31718142784693 -1.30281857215307"
Part 6 Null: the average weekly wages of men is equal to that of women Alt: the average weekly wages of men is not equal to that of women p-val = .0005 follows a t-test Since 0.0005 is less than 0.05 we can reject the null hypothesis that the average weekly wages of men is equal to that of women.
Question 7 Part 1: Japan
Part 2 Null: Given the horsepower of a car, the price will be equal no matter the country its from (Model 1) Alt: Given the horsepower of a car, the price will vary based on the country its from (Model 2) Reject the null hypothesis because the p-value is less than 0.05
f1 = 4604.7/18.2323
f2 = 1204.71/16.3566
pval = 1 - pf(f2, 4, 85)
pval
## [1] 0
Null: There is no interaction between horsepower and country. Alt: There is an interaction between horsepower and country. Reject the null hypothesis because the p-value is less than 0.05
f3 = 698.471/16.0957
pval = 1 - pf(f3, 7, 82)
pval
## [1] 0