Additional Material: ANOVA tutorial
https://datascienceplus.com/one-way-anova-in-r/
Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.
Raw data of phase shift, in hours, for the circadian rhythm experiment
Treatment | Phase Shift (hr) |
---|---|
Control | 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27 |
Knees | 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61 |
Eyes | -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83 |
Consider the following incomplete R output:
Source | Df | Sum of Squares | Mean Squares | F-statistics | p-value |
---|---|---|---|---|---|
Treatments | ? | ? | 3.6122 | ? | 0.004 |
Error | ? | 9.415 | ? | ||
TOTAL | ? | ? |
Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.
#Create Dataframe
Phase_Shift <- c(0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27,
0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61,
-0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83)
Treatments <- c("Control", "Control", "Control", "Control", "Control", "Control", "Control","Control",
"Knees","Knees","Knees","Knees","Knees","Knees","Knees",
"Eyes","Eyes","Eyes","Eyes","Eyes","Eyes","Eyes")
Treatments <- as.factor(Treatments)
df <- data.frame(Phase_Shift,Treatments )
head(df)
#ANOVA
res.aov <- aov(Phase_Shift ~ Treatments, data = df )
#summary
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatments 2 7.224 3.612 7.289 0.00447 **
## Residuals 19 9.415 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer:
Note: Answers are repeated here from the ANOVA table above
DfT reatments = k − 1 = 2
DfError = N − k = 22 − 3 = 19
DfT otal = DfT reatments + DfErorr = (N − k) + (k − 1) = N − 1 = 21
SSTR = MSTR × (k − 1) = 3.6122 × 2 = 7.224
SST = SSE + SSTR = 7.224 + 9.415 = 16.639
MSE = SSE/(N − k) = 9.415/19 = 0.4955
F-test = MSTR/MSE = 3.6122/0.4955 = 7.29
# calculating means of the parameters
model.tables(res.aov,type="means")
## Tables of means
## Grand mean
##
## -0.7127273
##
## Treatments
## Control Eyes Knees
## -0.3087 -1.551 -0.3357
## rep 8.0000 7.000 7.0000
Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.
• µ1: mean phase shift for subjects in Control group. Its estimate, \(\mu_1\), is -0.3087
• µ2: mean phase shift for subjects in Knees group. Its estimate, \(\mu_2\), is -0.3357
• µ3: mean phase shift for subjects in Eyes group. Its estimate, \(\mu_3\), is -1.551
Use the ANOVA table in Question A1 to answer the following questions:
H0 :µ1=µ2=µ3
HA : at least 2 of the means are not equal (\(\mu_1\) not equal \(\mu_2\) and/or \(\mu_1\) not equal =\(\mu_3\) and/or \(\mu_3\) not equal \(\mu_2\))
1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic: \(F\)(2, 19)
1 pts What is the p-value of the ANOVA \(F\)-test?
p-value = 0.00447, as shown from ANOVA table
Yes, We reject the null hypothesis that all three means are equal because the p-value is much smaller than 0.05. Therefore, the mean of the phase shift is not the same for all three treatment groups, and we conclude that light treatment does affect phase shift. We can conclude this because if any one of the three treatments is different, then at least one of the treatments must be different from the control group.
We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.
The data file includes the following columns:
The data is in the file “machine.csv”. To read the data in R
, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R
function read.csv()
.
# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)
# Your code here...
ggplot(data, aes(chmax, performance) ) +
geom_point() +
stat_smooth()+theme_economist()+ ggtitle("Part_B1-Fig1: Scatter Plot of Performance vs. CPU")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#Answer:
# Your code here...
# Build the model
linear_model <- lm(performance ~ chmax, data = data)
linear_model
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Coefficients:
## (Intercept) chmax
## 37.225 3.744
summary(linear_model)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
cor.test(data$performance , data$chmax, method ='pearson')
##
## Pearson's product-moment correlation
##
## data: data$performance and data$chmax
## t = 10.938, df = 207, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5115078 0.6846867
## sample estimates:
## cor
## 0.6052093
#Answer:
Yes but with variable transformation. Although there’s a slight upward trend in the scatter plot; higher performance level as chmax increases which means as x increases, y also increases. However, it’s not as strong as it should be. In addition, due to bunching up in the lower half of the data, this model appears to suffer from heteroscedasticity, a violation of one of the 4 main assumptions of linear regression model. The linear regression model as-is is OK but not great. We can still run the regression but bear in mind, there’s a possibility coefficients may be biased.
Yes, I think the data could be transformed as it is “bunched up” in the lower range of the predictor variable, chmax.This suggest that we can “stretch” out the data more and improves the model.A good one to try is Box-Cox transformation
Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:
# Your code here...
model1 = lm(performance ~ chmax, data)
print(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Coefficients:
## (Intercept) chmax
## 37.225 3.744
#confidence interval @95% confidence level
ci <- confint(model1, level = .95)
colnames(ci) <- c("L", "U")
rownames(ci) <- c("beta0", "beta1")
ci
## L U
## beta0 15.817392 58.633048
## beta1 3.069251 4.418926
Intercept, beta0: 37.225, beta1, slope(chmax): 3.744 and Residual standard error: 128.3 or variance is 16460.89 on 207 degrees of freedom
performance= 37.225 + 3.744chmax
A 1 unit increase in the predictor variable, x (chmax) will result in an increase of 3.744 units in the response variable (performance)
\(\beta_1\)_lower=3.069251 and \(\beta_1\)_upper=4.418926
• H0 : β1 ≤ 0
• HA : β1 > 0
To check β1 > 0 this essentially becomes a one-sided test, and then proceed to compute p-values from it:
tval = summary(model1)$coefficients['chmax','t value']
df = model1$df.residual
pval = pt(tval, df, lower=F)
pval
## [1] 1.423882e-22
#Answer:
From above output, the p-value is 1.423882e-22 which is essentially zero and statistically significant at the 0.01 level, we can conclude that β1 is statistically significantly positive
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
# Your code here...
ggplot(data, aes(chmax, performance) ) +
geom_point() +
stat_smooth()+theme_economist()+ ggtitle("Part_B3-Fig2: Scatter Plot of Performance vs. CPU")
Model Assumption(s) it checks: Linearity/Mean Zero, Independence and Constant Variance
Interpretation:
#Answer:
# Your code here...
#get list of residuals
res <- resid(model1)
#produce residual vs. fitted plot
plot(fitted(model1), res)
#add a horizontal line at 0
abline(0,0)
Model Assumption(s) it checks: Independence (Uncorrelated errors) and Constant Variance
Interpretation:
#Answer:
There appear to be “bunching up” at the lower ranges, this is a feature of the data set. Most of the x predictor values were small which translated to lower fitted values (performances). Many of the predictors, chmax, have identical maximum channel values at the same performance levels. This results in 2 distinctive bands which does not mean there were correlation of data points. Correlation of data points means the past affects the present or future which is not the case here, ie,. dependence. Since there were not correlated errors, the independence assumption appears to hold. However, the plot does show heteroscedasticity with variance increasing as the fitted values increase. This is indicative the constant variance assumption has been violated.
# Your code here...
#create histogram of residuals
ggplot(data = data, aes(x = model1$residuals)) +
geom_histogram(fill = 'yellow', color = 'black', bins = 30) +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')+theme_economist()
#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res)
Model Assumption(s) it checks: Normality checks
Interpretation:
#Answer:
-The histogram of residuals is quite normal but with heavy tails while the Q-Q plot shows deviations on both ends of the curve. The histogram contained some outlier points which led to deviations of Q-Q plot at both ends. This means the model can be improved upon via box-cox transformations
boxCox()
) in car()
package or (boxcox()
) in MASS()
package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?# Your code here...
hist(data$chmax)
bc <- boxcox((data$performance ~ data$chmax))
lambda <- bc$x[which.max(bc$y)]
#Answer:
Optimal lambda, rounding to nearest half-integer is zero
B/C lambda is zero, the literature recommends a log transformation of the response variable (performance) is needed
# Your code here...
#adding 1 to predictors before taking natural log
chmax_1 <- data$chmax+1
log_chmax_1 <- log(chmax_1)
#transformed response variable
transformed_y <- log(data$performance)
#model2 using transformed response variable
model2 <- lm(transformed_y ~ log_chmax_1 )
#define plotting area
op <- par(pty = "s", mfrow = c(1, 3))
#Q-Q plot for original model
qqnorm(model1$residuals)
qqline(model1$residuals)
#Q-Q plot for Box-Cox transformed model
qqnorm(model2$residuals)
qqline(model2$residuals)
#residual plots
res2 <- resid(model2)
#produce residual vs. fitted plot
plot(fitted(model2), res2 ,main = 'Residuals vs fitted')
#display both Q-Q plots
par(op)
#Answer:
With log transformations, the q-q and residual plots looks much improved!
Normality of residuals looks much better and the residuals vs. fitted is now randomly scattered around zero.
This means there is an improvement of explanatory power of the model
print(paste0(c('Adj. R-square of model1:',summary(model1)$adj.r.squared)))
## [1] "Adj. R-square of model1:" "0.363216830561439"
print(paste0(c('Adj. R-square of model2:',summary(model2)$adj.r.squared)))
## [1] "Adj. R-square of model2:" "0.407443764925801"
#Answers:
print('Summary of Model2:')
## [1] "Summary of Model2:"
summary(model2)
##
## Call:
## lm(formula = transformed_y ~ log_chmax_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22543 -0.59429 0.01065 0.59287 1.85995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47655 0.14152 17.5 <2e-16 ***
## log_chmax_1 0.64819 0.05401 12.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared: 0.4103, Adjusted R-squared: 0.4074
## F-statistic: 144 on 1 and 207 DF, p-value: < 2.2e-16
#Answer:
The model with transformations improved, R-Square increased and all estimated parameters still remained statistically significant
The clustering is gone after transformation, heteroskedasticity is gone and the data is more normal; yes, I think this is a much improved fit
Suppose we are interested in predicting CPU performance when chmax = 128
. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?
# predict using model1
new = data.frame(chmax = 128)
predict.lm(model1, new, interval = "predict", level = 0.95)
## fit lwr upr
## 1 516.4685 252.2519 780.6851
# predict using model2
chmax <- c(log(c(128)+1))
new=list(chmax)
#take the anti-log=> reversing response variable to original scale
cat('model2 prediction and its interval in the original scale', end="\n")
## model2 prediction and its interval in the original scale
head(exp(predict.lm(model2, new, interval="prediction", level=0.95)),1)
## fit lwr upr
## 1 277.723 55.17907 1397.813
#Answer:
Model1 predicts a performance of 516.4685 with a lower bound of 252.2519 and an upper bound of 780.6851 with 95% confidence interval. Model2 predicts a performance of 277.723 with a lower bound of 55.17907 and an upper bound of 1397.813 for 95% confidence interval. Model2 with log transformation, has a larger confidence interval than model1, the model without data transformations. In addition, the predicted value is much lower in Model2 than Model1.
We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.
# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
data2
, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.ggplot(data2, aes(x=vendor, y=performance, color=vendor)) +
geom_boxplot() +
ggtitle("Box Plot of Vendor vs Performance")+theme_economist()
#Answer:
The box plot shows clear distinction between vendors’ performance. The vendor nas had the highest avg performance level
nas also appears to have exhibited the highest variability with “bigger box” and longer whiskers
model3 = aov(performance ~ vendor, data2)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## vendor 2 154494 77247 6.027 0.00553 **
## Residuals 36 461443 12818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer:
The p-value of the F-test is 0.005527, which is less than the α-level of 0.05(the one with one asterisk). So we reject the null hypothesis that the mean CPU performance of all three vendors are equal, and conclude that the mean CPU performance of at least one vendor is different.
TukeyHSD(model3, "vendor", conf.level=0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = performance ~ vendor, data = data2)
##
## $vendor
## diff lwr upr p adj
## hp-honeywell -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320 16.82659 216.0398 0.0188830
## nas-hp 140.46617 18.11095 262.8214 0.0214092
#Answer:
Only nas-honeywell and nas-hp have p-values of the pairwise comparisons lower than α-level of 0.05 and do not contain zero between their lower and upper bounds. This indicated these 2 pairs are statistically significantly different from each other at the α-level of 0.05 while hp-honeywell does contain zero and it’s p-level higher than 0.05 indicating that its not statistically significantly different.