R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

myIllustration <- function(data_generating_process, beta_1 = 0, n = 20, nTrials = 1000){
Tstats = numeric(nTrials)
for(i in 1:nTrials){
  X = seq(-1, 1, length.out = n)
  Y = data_generating_process(X, beta_1)
  Tstats[i] <- summary(lm(Y ~ X))$coefficient[2,3]
}
DF <- data.frame(t = Tstats)
par(mfrow = c(2,2))
plot(lm(Y ~ X))
cutoff <- qt(0.975, df = n-2)
library(ggplot2)
# overlay histogram and normal density
ggplot(DF, aes(x = t)) +
  geom_histogram(aes(y = stat(density)), bins = 30) +
  geom_density(lwd = 2, col = "blue") +
  stat_function(
  fun = dt,
  args = list(df = n-2),
  lwd = 2,
  col = 'red'
  ) +
  annotate("rect", xmin = -Inf, xmax = -cutoff, ymin=-Inf, ymax=+Inf, alpha = .2, fill = "orange")+
  annotate("rect", xmin = cutoff, xmax = Inf, ymin=-Inf, ymax=+Inf, alpha = .2, fill = "orange") +
  labs(title=paste("Falsely reject the null hypothesis for about ",
    100* mean(abs(Tstats) > cutoff), "% of the times.", sep = ''))
}

Part 1 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, X^2)

set.seed(321)
mild_unequal_var_model <- function(X, beta_1){
epsilon = rnorm(length(X)) * X
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(mild_unequal_var_model)

## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Part 2 True model: Yi = 0.5 + 0X1 + Ei, where Ei ∼ N(0, exp(1+10X)^2)

set.seed(321)
strong_unequal_var_model <- function(X, beta_1){
epsilon = rnorm(length(X)) * exp(1+10*X)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(strong_unequal_var_model)

Part 3 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ t(0, X^2)

No model assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have a few visible departure from normality as shown in the “Normal Q-Q” plot but not enough to invalidate the assumption of normality. - There is nothing in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 5% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 5%)

set.seed(321)
t_error_model <- function(X, beta_1){
epsilon = rt(length(X), df = 2)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(t_error_model)

Part 4 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ Cauchy(0, X^2)

Normality assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is nothing in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 4% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 4%)

set.seed(321)
cauchy_error_model <- function(X, beta_1){
epsilon = rcauchy(length(X))
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(cauchy_error_model)

Part 5 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, exp(0.5)^2)

Normality assumption and constant variance assumption has been violated. - There appears to be a quadratic pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is a quadratic pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 4% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 4%)

set.seed(321)
skewed_error_model <- function(X, beta_1){
epsilon = exp(rnorm(length(X))) - exp(0.5)
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(skewed_error_model)

Part 6 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ AR(0, e)

Normality and constant variance assumption has been violated. - There is a quadratic visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot and does not follow a 45 degree line. - There is a pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density are not consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 16% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 16%)

set.seed(321)
dependent_error_model <- function(X, beta_1){
n = length(X)
epsilon = rnorm(n)
epsilon[2:n] = epsilon[1:(n-1)] + epsilon[2:n]
Y = 0.5 + beta_1 * X + epsilon
return(Y)
}
myIllustration(dependent_error_model)

Part 7 True model: Yi = 0.5 + 0*X1 + Ei, where Ei ∼ N(0, (x+2X2)2)

Constant variance assumption has been violated. - There is no visible pattern in the “Residual vs Fitted” plot that suggests a different regression function. - The standardized residuals do have several visible departures from normality as shown in the “Normal Q-Q” plot but not enough to validate the assumption. - There is a pattern in the “Scale-Location” plot that suggests unequal variance of the error terms. - There is no obvious outlier or influential point in the “Residuals vs Leverage” plot. - The simulated t-statistics have a histogram/estimated density almost consistent with that of a t distribution. - If we reject the null hypothesis when |t| > tα/2, we would indeed falsely reject for about 2% of the times. In particular, we are not making false rejections more often than we have signed up for (that is, 2%)

set.seed(321)
quadratic_model <- function(X, beta_1){
epsilon = rnorm(length(X))
Y = 0.5 + beta_1 * X + 2*X^2 + epsilon
return(Y)
}
myIllustration(quadratic_model)

Question 2

state.data = data.frame(state.x77)
head(state.data)
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
library(GGally)

Part 1

state_pairs <- state.data[c(-4,-5,-7,-8)]
ggpairs(data = state_pairs)

Part 2 Population and HS.Grad are statistically significant at the 0.05 level.

state_model <- lm(data = state_pairs, Income ~ .)
summary(state_model)
## 
## Call:
## lm(formula = Income ~ ., data = state_pairs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -987.30 -213.67  -50.68  219.74 1430.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1940.74115  710.64526   2.731 0.008924 ** 
## Population     0.03786    0.01500   2.524 0.015129 *  
## Illiteracy   -73.57563  145.07584  -0.507 0.614470    
## HS.Grad       45.57445   10.93778   4.167 0.000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 465.8 on 46 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.4254 
## F-statistic: 13.09 on 3 and 46 DF,  p-value: 2.612e-06

Part 3 - Residuals vs fitted and Scale-location plot may have a quadratic shape which may suggest a nonconstant variance among errors - The QQ plot suggests that the errors are not normally distributed. - Residuals vs Leverage plot suggests that Alaska is influential.

plot(state_model)

Part 4

plot(abs(dffits(state_model)), pch=23, bg='orange', cex=2, ylab="DFFITS")

state_pairs[which(abs(dffits(state_model)) > 0.5),]
##              Population Income Illiteracy HS.Grad
## Alaska              365   6315        1.5    66.7
## California        21198   5114        1.1    62.6
## Mississippi        2341   3098        2.4    41.0
## New Mexico         1144   3601        2.2    55.2
## North Dakota        637   5087        0.8    50.3
## Utah               1203   4022        0.6    67.3

Part 5

plot(cooks.distance(state_model), pch=23, bg='orange', cex=2, ylab="Cook's distance")

state_pairs[which(cooks.distance(state_model) > 0.1),]
##            Population Income Illiteracy HS.Grad
## Alaska            365   6315        1.5    66.7
## California      21198   5114        1.1    62.6
## New Mexico       1144   3601        2.2    55.2
## Utah             1203   4022        0.6    67.3

Part 6

plot(hatvalues(state_model), pch=23, bg='orange', cex=2, ylab='Hat values')

state_pairs[which(hatvalues(state_model) > 0.3),]
##            Population Income Illiteracy HS.Grad
## California      21198   5114        1.1    62.6

Part 7 The model P-value decreased. The adjusted R-squared increased.

library(car)
## Loading required package: carData
outlierTest(state_model)
##        rstudent unadjusted p-value Bonferroni p
## Alaska 3.905347         0.00031274     0.015637
state_prime <- state_pairs[-2,]
state_prime_model <- lm(data = state_prime, Income ~ .)
summary(state_prime_model)
## 
## Call:
## lm(formula = Income ~ ., data = state_prime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -806.64 -223.85  -94.83  228.49  895.23 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2922.00458  669.84639   4.362 7.42e-05 ***
## Population     0.04463    0.01322   3.375  0.00153 ** 
## Illiteracy  -248.72103  134.46160  -1.850  0.07092 .  
## HS.Grad       29.75007   10.38054   2.866  0.00630 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 407 on 45 degrees of freedom
## Multiple R-squared:  0.4997, Adjusted R-squared:  0.4663 
## F-statistic: 14.98 on 3 and 45 DF,  p-value: 6.723e-07

Part 8 California, Hawaii, Mississippi, New Mexico, North Dakota, Utah

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
influence_table <- influence.measures(state_prime_model)[1]
influence_table <- influence_table$infmat
influence_table <- data.frame(influence_table)
influence_table %>% filter(abs(cook.d) > 0.1)
##                dfb.1_   dfb.Pplt   dfb.Illt   dfb.HS.G      dffit     cov.r
## California  0.3256118 -0.7352332 -0.1341715 -0.3031662 -0.8233090 1.5772610
## Hawaii     -0.8348990 -0.2902838  0.8618180  0.8356482  0.9955323 1.0639448
## Utah        0.4514076  0.2065283 -0.1867060 -0.5544962 -0.7396401 0.8100085
##               cook.d       hat
## California 0.1689473 0.3736192
## Hawaii     0.2356830 0.2305314
## Utah       0.1262576 0.1033623
influence_table %>% filter(abs(dffit) > 0.5)
##                   dfb.1_   dfb.Pplt   dfb.Illt    dfb.HS.G      dffit     cov.r
## California    0.32561184 -0.7352332 -0.1341715 -0.30316616 -0.8233090 1.5772610
## Hawaii       -0.83489897 -0.2902838  0.8618180  0.83564818  0.9955323 1.0639448
## Mississippi   0.02179137  0.1533320 -0.3167685  0.02426796 -0.5255979 1.0236858
## New Mexico    0.44773506  0.1960150 -0.5618657 -0.42862890 -0.6172660 1.1633894
## North Dakota  0.40530843 -0.2451028 -0.3511830 -0.33465575  0.5658469 0.7624464
## Utah          0.45140764  0.2065283 -0.1867060 -0.55449617 -0.7396401 0.8100085
##                  cook.d       hat
## California   0.16894731 0.3736192
## Hawaii       0.23568297 0.2305314
## Mississippi  0.06737841 0.1150322
## New Mexico   0.09390762 0.1880312
## North Dakota 0.07362874 0.0610797
## Utah         0.12625762 0.1033623

Question 3

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
hills.table <- hills
U = rep(0, nrow(hills.table))
U[21] = 1
hills.table$U = U
hills.table
##                  dist climb    time U
## Greenmantle       2.5   650  16.083 0
## Carnethy          6.0  2500  48.350 0
## Craig Dunain      6.0   900  33.650 0
## Ben Rha           7.5   800  45.600 0
## Ben Lomond        8.0  3070  62.267 0
## Goatfell          8.0  2866  73.217 0
## Bens of Jura     16.0  7500 204.617 0
## Cairnpapple       6.0   800  36.367 0
## Scolty            5.0   800  29.750 0
## Traprain          6.0   650  39.750 0
## Lairig Ghru      28.0  2100 192.667 0
## Dollar            5.0  2000  43.050 0
## Lomonds           9.5  2200  65.000 0
## Cairn Table       6.0   500  44.133 0
## Eildon Two        4.5  1500  26.933 0
## Cairngorm        10.0  3000  72.250 0
## Seven Hills      14.0  2200  98.417 0
## Knock Hill        3.0   350  78.650 0
## Black Hill        4.5  1000  17.417 0
## Creag Beag        5.5   600  32.567 0
## Kildcon Hill      3.0   300  15.950 1
## Meall Ant-Suidhe  3.5  1500  27.900 0
## Half Ben Nevis    6.0  2200  47.633 0
## Cow Hill          2.0   900  17.933 0
## N Berwick Law     3.0   600  18.683 0
## Creag Dubh        4.0  2000  26.217 0
## Burnswark         6.0   800  34.433 0
## Largo Law         5.0   950  28.567 0
## Criffel           6.5  1750  50.500 0
## Acmony            5.0   500  20.950 0
## Ben Nevis        10.0  4400  85.583 0
## Knockfarrel       6.0   600  32.383 0
## Two Breweries    18.0  5200 170.250 0
## Cockleroi         4.5   850  28.100 0
## Moffat Chase     20.0  5000 159.833 0

Part 1 Both studentized residuals equal 0.20547852. The t value is 0.2054782 which is effectively 0.

model1 = lm(data = hills.table, time ~ dist + climb)
rstudent(model1)[21]
## Kildcon Hill 
##    0.2054782
model2 = lm(data = hills.table, time ~ .)
summary(model2)$coef
##                Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept) -9.20070756  4.48509033 -2.0513985 4.875911e-02
## dist         6.22407066  0.61107644 10.1854207 2.062760e-11
## climb        0.01108789  0.00209136  5.3017600 9.039717e-06
## U            3.15212936 15.34045674  0.2054782 8.385419e-01
summary(model2)$coef[4,3]
## [1] 0.2054782

Part 2 Both are 0.4222

model3 <- anova(model1, model2)
summary(model3)
##      Res.Df           RSS             Df      Sum of Sq           F          
##  Min.   :31.00   Min.   :6882   Min.   :1   Min.   :9.374   Min.   :0.04222  
##  1st Qu.:31.25   1st Qu.:6885   1st Qu.:1   1st Qu.:9.374   1st Qu.:0.04222  
##  Median :31.50   Median :6887   Median :1   Median :9.374   Median :0.04222  
##  Mean   :31.50   Mean   :6887   Mean   :1   Mean   :9.374   Mean   :0.04222  
##  3rd Qu.:31.75   3rd Qu.:6890   3rd Qu.:1   3rd Qu.:9.374   3rd Qu.:0.04222  
##  Max.   :32.00   Max.   :6892   Max.   :1   Max.   :9.374   Max.   :0.04222  
##                                 NA's   :1   NA's   :1       NA's   :1        
##      Pr(>F)      
##  Min.   :0.8385  
##  1st Qu.:0.8385  
##  Median :0.8385  
##  Mean   :0.8385  
##  3rd Qu.:0.8385  
##  Max.   :0.8385  
##  NA's   :1
(rstudent(model1)[21])^2
## Kildcon Hill 
##   0.04222129

Part 3 and 4

hills.table2 <- hills.table[-25,]
model1 = lm(data = hills.table2, time ~ dist + climb)
model1$coefficients
## (Intercept)        dist       climb 
## -9.15332986  6.22566833  0.01106511
model2$coefficients
## (Intercept)        dist       climb           U 
## -9.20070756  6.22407066  0.01108789  3.15212936

Question 4

set.seed(321)
X1 = runif(20,-1,1)
X2 = runif(20,-1,1)
X3 = runif(20,-1,1)
epsilon = rnorm(20)*0.5
Y = 0.5 + 3 * X1 + 4 * X2^2 + 0 * X3 + epsilon
toydata = data.frame(Y, X1, X2, X3)
model = lm(Y~., data = toydata)

Part 1 X1 appears to be the only predictor with a linear avplot, hence it should be included.

library(car)
avPlots(model, 'X1')

avPlots(model, 'X2')

avPlots(model, 'X3')

Part 2 X1 and x3 should be used as a linear term.

crPlots(model, "X1")

crPlots(model, "X2")

crPlots(model, "X3")

Question 5

tomasetti = read.csv("http://people.math.binghamton.edu/qiao/data501/data/Tomasetti.csv")
tomasetti$cluster_type <- ifelse(tomasetti$Cluster == 'Replicative', 1, 0)
tomasetti$cluster_type <- factor(tomasetti$cluster_type)
head(tomasetti)
##                                            Type   Risk     Lscd       Cluster
## 1                        Acute myeloid leukemia 0.0041 1.30e+11   Replicative
## 2                          Basal cell carcinoma 0.3000 3.55e+12 Deterministic
## 3                  Chronic lymphocytic leukemia 0.0052 1.30e+11   Replicative
## 4                     Colorectal adenocarcinoma 0.0480 1.17e+12 Deterministic
## 5            Colorectal adenocarcinoma with FAP 1.0000 1.17e+12 Deterministic
## 6 Colorectal adenocarcinoma with Lynch syndrome 0.5000 1.17e+12 Deterministic
##   cluster_type
## 1            1
## 2            0
## 3            1
## 4            0
## 5            0
## 6            0

Part 1 Replicative: predicted log(Risk) = -17.96976 + 0.56117log(lscd) + -72.894271log(lscd) Deterministic: predicted log(Risk) = -17.96976 + 0.56117log(lscd)

tomasetti.lm <- lm(data = tomasetti, log(Risk) ~ log(Lscd) + Risk:cluster_type)
summary(tomasetti.lm)
## 
## Call:
## lm(formula = log(Risk) ~ log(Lscd) + Risk:cluster_type, data = tomasetti)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1447 -1.0083  0.2467  0.7036  2.7768 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -15.88106    1.73849  -9.135 9.53e-10 ***
## log(Lscd)            0.44557    0.08275   5.385 1.09e-05 ***
## Risk:cluster_type0   4.50801    1.69135   2.665   0.0128 *  
## Risk:cluster_type1  -2.40291   67.47631  -0.036   0.9719    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.559 on 27 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.701 
## F-statistic: 24.44 on 3 and 27 DF,  p-value: 7.429e-08

Part 2

library(ggplot2)
ggplot(tomasetti, aes(x = log(Lscd), y = log(Risk), color = cluster_type, shape = cluster_type)) +
  geom_point(size = 3) +
  geom_line(aes(y = -17.96976 + (0.56117*log(Lscd) + -72.89427*1*log(Lscd))), color = "green", linewidth = 0.5) +
  geom_line(aes(y = -17.96976 + 0.56117*log(Lscd)), color = "purple", linewidth = 0.5)

Part 3 The p-value is 0.02501 which is less than .05. One would reject the null hypothesis that the reduced model is better.

tomasetti.original <- lm(data = tomasetti, log(Risk) ~ log(Lscd))
anova(tomasetti.lm, tomasetti.original)
## Analysis of Variance Table
## 
## Model 1: log(Risk) ~ log(Lscd) + Risk:cluster_type
## Model 2: log(Risk) ~ log(Lscd)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     27 65.649                              
## 2     29 86.275 -2   -20.626 4.2414 0.02501 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 6 1) There are 50 workers

Part 2

SSE = 338.449
SSR = 98.8313
SST = SSE + SSR
var_y = SST/49
var_y
## [1] 8.924088

Part 3

r2 = SSR/SST
r2
## [1] 0.2260136

Part 4 B0 is the intercept, and female coefficient B1 is the coefficient for males - females B0 + B1 is the male coefficient

Part 5

t_val <- qt(.975, df = 49)
B1 = -2.81
se = 0.75
lower = B1 - t_val*se
upper = B1 + t_val*se

print(paste(lower, upper))
## [1] "-4.31718142784693 -1.30281857215307"

Part 6 Null: the average weekly wages of men is equal to that of women Alt: the average weekly wages of men is not equal to that of women p-val = .0005 follows a t-test Since 0.0005 is less than 0.05 we can reject the null hypothesis that the average weekly wages of men is equal to that of women.

Question 7 Part 1: Japan

Part 2 Null: Given the horsepower of a car, the price will be equal no matter the country its from (Model 1) Alt: Given the horsepower of a car, the price will vary based on the country its from (Model 2) Reject the null hypothesis because the p-value is less than 0.05

f1 = 4604.7/18.2323
f2 = 1204.71/16.3566
pval = 1 - pf(f2, 4, 85)
pval
## [1] 0

Null: There is no interaction between horsepower and country. Alt: There is an interaction between horsepower and country. Reject the null hypothesis because the p-value is less than 0.05

f3 = 698.471/16.0957
pval = 1 - pf(f3, 7, 82)
pval
## [1] 0