Additional Material: ANOVA tutorial
https://datascienceplus.com/one-way-anova-in-r/
Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.
Raw data of phase shift, in hours, for the circadian rhythm experiment
| Treatment | Phase Shift (hr) |
|---|---|
| Control | 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27 |
| Knees | 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61 |
| Eyes | -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83 |
Consider the following incomplete R output:
Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.
#Clear the Environment
rm(list=ls())
setwd("C:/Users/krbm878/Documents/00 Graduate School/GT/ISYE 6414/Data")
set.seed(123)
#read in the Data
jetlag <- read.table("jetlag.txt", stringsAsFactors=FALSE, header=TRUE)
anovtable <- aov(PhaseShift ~ Treatment, data = jetlag, typ=2)
dF_Treat <- 3 - 1
dF_Error <- 22 - 3
dF_Total <- dF_Treat + dF_Error
MS_T <- 3.6122
SS_T <- MS_T * dF_Treat
SS_E <- 9.415
MS_E <- SS_E / dF_Error
SS_TOTAL <- SS_T + SS_E
summary(anovtable)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 7.224 3.612 7.289 0.00447 **
## Residuals 19 9.415 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\t ","\t | Df |","Sum Sq","\nTOTAL","|",dF_Total, "\t|", SS_TOTAL)
## | Df | Sum Sq
## TOTAL | 21 | 16.6394
Answer:
| Source | Df | Sum of Squares | Mean Squares | F-statistics | p-value |
|---|---|---|---|---|---|
| Treatments | 2 | 7.224 | 3.6122 | 7.29 | 0.004 |
| Error | 19 | 9.415 | 0.4955 | ||
| TOTAL | 21 | 16.639 |
Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.
# Specify data column
aggregate(x= jetlag$PhaseShift,
# Specify group indicator
by = list(jetlag$Treatment),
# Specify function (i.e. mean)
FUN = mean)
Answer: The mean phase shift estimates for each of the three categories are:
Use the ANOVA table in Question A1 to answer the following questions:
1 pts Write the null hypothesis of the ANOVA \(F\)-test, \(H_0\):
Answer: The Null Hypothesis for the ANOVA Test is \(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) (all means are equivalent)
1 pts Write the alternative hypothesis of the ANOVA \(F\)-test, \(H_A\)
Answer: The Alternative Hypothesis for the ANOVA Test is \(H_1\): \(\mu_1\) \(\ne\) \(\mu_2\), \(\mu_1\) \(\ne\) \(\mu_3\), and/or \(\mu_2\) \(\ne\) \(\mu_3\)
1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic:
Answer:\(F\)(2, 19)
1 pts What is the p-value of the ANOVA \(F\)-test?
Answer:The p-value of the \(F\)-test is 0.004, which is shown above.
1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05.
Answer:We reject the null hypothesis at 95% Confidence that all means are equal because the p-value is less than 0.05. The mean phase shift is not the same for each light treatment, so therefore, the light treatment used statistically affects the phase shift of the assessment.
We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.
The data file includes the following columns:
The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().
#Read the data
setwd("C:/Users/krbm878/Documents/00 Graduate School/GT/ISYE 6414/Data")
rm(list=ls())
library(car)
data = read.csv("machine.csv", head = TRUE, sep = ",")
head(data)
#Read the data
#load library
library (ggfortify)
## Loading required package: ggplot2
library (ggplot2)
library (cowplot)
ggplot(data=data, aes(x=chmax, y=performance)) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
xlab('CPU Maximum Channels') +
ylab('CPU Performance') +
ggtitle('Comparison of CPU Performance versus Maximum Number of Channels')
ggplot(data=data, aes(x=chmax, y=performance, color=vendor)) + geom_point() +
stat_smooth(method = "lm", se = FALSE) +
xlab('CPU Maximum Channels') +
ylab('CPU Performance') +
ggtitle('Comparison of CPU Performance versus Maximum Number of Channels')
Explaination: The general trend of CPU Performance versus CPU Maximum Channels appears to have a Positive increasing association. However, as the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold.
# Calculate the correlation coefficient
cor(data$chmax,data$performance)
## [1] 0.6052093
Explaination: The correlation coefficient of 0.6052 confirms there is a moderately positive, increasing association between CPU Maximum Channels and CPU Performance. The strength of this correlation is moderate as the correlation coefficient between 0.5 and 0.7. This aligns with the observations from Part a.
Explaination: The assumptions of the simple linear regression are Linearity, constant variance, independence, and normality. Since constance variance does not hold, I would not recommend a simple linear regression model for the relationship. Additionally, the moderate strength of the correlation coefficient, it wouldn’t be reasonable to assume a simple linear regression model for this relationship would provide the desired insights.
Explaination: Since the variance in the CPU Performance increases with increasing Number of Channels, a transformation would be appropriate to improve the linearity between the response variable. Since the constant variance assumption does not hold, a transformation that normalizes or variance-stablizes the response variable would help in this case.A power transformation of the response variable could be performed. If lambda is equal to 1, we do not transform. If lambda is equal to 0, we could use the normal logarithmic transformation. If lambda is equal to -1, we could use the inverse of y, known as a Box-Cox Transformation.
Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:
library(caret)
## Loading required package: lattice
model1 = lm(performance~ chmax, data=data)
summary(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
#Coefficients (B0)
print(model1$coefficients)
## (Intercept) chmax
## 37.225220 3.744088
#Sigma
sigma <-summary(model1)$sigma
#Variance of error terms
sigmasq <- sigma^2
sigma
## [1] 128.3408
sigmasq
## [1] 16471.37
Explaination: The model parameters are provided above and the estimates are as follows: \(B_0\) (Intercept) and and estimate of \(\hat{B_0}\) is 37.225220. \(B_1\) (slope) and and estimate of \(\hat{B_1}\) is 3.744088. \(\sigma^2\) and and estimate of \(\hat{\sigma^2}\) is 16471.37.
2 pts Write down the estimated simple linear regression equation.
\(performance\) = 3.744088 x (\(chmax\)) + 37.225220
2 pts Interpret the estimated value of the \(\beta_1\) parameter in the context of the problem.
For every single unit increase in CPU maximum channels, the CPU performance increases by 3.744088 units.
2 pts Find a 95% confidence interval for the \(\beta_1\) parameter. Is \(\beta_1\) statistically significant at this level?
confint(model1,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 15.817392 58.633048
## chmax 3.069251 4.418926
t_test = summary(model1)$coefficients['chmax','t value']
df = model1$df.residual
p = pt(t_test,df,lower=F)
p
## [1] 1.423882e-22
The 95% confidence interval of \(\hat{B_1}\) is (3.069251, 4.418926), and \(\hat{B_1}\) is 3.744088. This is statistically significantly positive because the pvalue is 1.423882e-22 which is less than the \(\alpha\) level of 0.05
2 pts Is \(\beta_1\) statistically significantly positive at an \(\alpha\)-level of 0.01? What is the approximate p-value of this test?
\(\hat{B_1}\) is 3.744088, which is statistically significantly positive because the pvalue is 1.423882e-22 which is less than the \(\alpha\) level of 0.01
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
ggplot(data=data, aes(x=chmax, y=performance)) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
xlab('CPU Maximum Channels') +
ylab('CPU Performance') +
ggtitle('Comparison of CPU Performance versus Maximum Number of Channels') +
geom_line(data=data,aes(x=chmax,y=model1$fitted.values),color='blue')
## `geom_smooth()` using formula 'y ~ x'
Model Assumption(s) it checks: Linearity and constant variance
Interpretation: There are apparent departures from the assumptions of the linear regression model for linearity, as the strength of this correlation is moderate as the correlation coefficient between 0.5 and 0.7. After approximately 40 CPU Maximum Channels, the fitted linear model is no longer centered with the data, as there are more measurements of CPU performance below the fitted than and very few measurements above the fitted line. Additionally, as the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold.
ggplot(data=data, aes(x=model1$fitted.values, y=model1$residuals)) + geom_point() +
geom_hline(yintercept = 0, color='red') +
xlab('Fitted Values') +
ylab('Residuals') +
ggtitle('Residual Plot of Model')
**Model Assumption(s) it checks: Constant variance and potentially Linearity as the relationship is not linear.
Interpretation:As the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold. Additionally, the widening of the relationship between residuals and the predicting variable showing their relationship is not linear and is a departure from the linearity assumption.
library(magrittr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggplot(data = data, aes(x = model1$residuals)) +
geom_histogram(bins = 15, fill = 'blue', color = 'black') +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')
ggplot(data = data, aes(sample = model1$residuals)) +
stat_qq(color='darkgreen') + stat_qq_line() +
labs(title = 'Q-Q Plot of Residuals', x = 'Theoretical Quantiles', y = 'Residual Quantiles')
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.75156, p-value < 2.2e-16
Model Assumption(s) it checks: Normality
Interpretation: Since the quantiles of the residuals do not line up with the normal quantiles, thereby departing from a straight line and forming tails at both ends, there is an indication of either a skewed distribution, or heavy-tail distribution. The histogram confirms that there heavy-tails in our distribution, thereby confirming the normality asssumption does not hold. The Shapiro-Wiki, checks for Normality and reported a p-value near 0, which is less than 0.05 (95% confidence level). The null hypothesis for the test is that the data is normally distributed is rejected by this this test, confirming Normality in the data does not hold.
boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?library (car)
library (MASS)
box_cox_plot = boxCox(model1)
lambda <- box_cox_plot$x[which.max(box_cox_plot$y)]
cat("Optimal Lambda:", format(round(lambda/0.5)*0.5,nsmall=1),end="\n")
## Optimal Lambda: 0.0
Interpretation: Since \(\lambda\) is equal to 0, a normal logarithmic transformation could normalize or variance-stablize the response variable since the normality and constant variance assumption does not hold.
model2 = lm(log(performance)~ log(chmax+1), data=data)
summary(model2)
##
## Call:
## lm(formula = log(performance) ~ log(chmax + 1), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22543 -0.59429 0.01065 0.59287 1.85995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47655 0.14152 17.5 <2e-16 ***
## log(chmax + 1) 0.64819 0.05401 12.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared: 0.4103, Adjusted R-squared: 0.4074
## F-statistic: 144 on 1 and 207 DF, p-value: < 2.2e-16
cat("R-Squared values Before Transformation (Model 1): ", summary(model1)$r.squared)
## R-Squared values Before Transformation (Model 1): 0.3662783
cat("R-Squared values After Transformation (Model 2): ", summary(model2)$r.squared)
## R-Squared values After Transformation (Model 2): 0.4102926
Observation: After transforming our simple linear regression, we can see in improvement in our correlation coefficient, which indicates an improves the explanitory power and linearity assumption in the simple linear regression model.
# Assess all Model Assumptions
scatter_plot_2 = ggplot(data=data, aes(x=log(chmax+1), y=log(performance))) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
xlab('CPU Maximum Channels') +
ylab('CPU Performance') +
ggtitle('CPU Performance vs Maximum # of Channels') +
geom_line(data=data,aes(x=log(chmax+1),y=model2$fitted.values),color='blue')
residual_2 = ggplot(data=data, aes(x=model2$fitted.values, y=model2$residuals)) + geom_point() +
geom_hline(yintercept = 0, color='red') +
xlab('Fitted Values') +
ylab('Residuals') +
ggtitle('Residual Plot of Model 2')
histo_2 = ggplot(data = data, aes(x = model2$residuals)) +
geom_histogram(bins = 12, fill = 'blue', color = 'black') +
labs(title = 'Histogram of Model 2 Residuals', x = 'Residuals', y = 'Frequency')
qqp_2 =ggplot(data = data, aes(sample = model2$residuals)) +
stat_qq(color='darkgreen') + stat_qq_line() +
labs(title = 'Q-Q Plot of Residuals', x = 'Theoretical Quantiles', y = 'Residual Quantiles')
shapiro.test(model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.99287, p-value = 0.4086
plot_grid(scatter_plot_2,residual_2,histo_2,qqp_2,ncol=2,nrow=2)
## `geom_smooth()` using formula 'y ~ x'
Analysis:
Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?
#Find predicted CPU Performance when Max Channels is 128 using linear regression models
CPU_Pred = data.frame(chmax=128)
cat("Model 1 Prediction:","\n")
## Model 1 Prediction:
predict(model1,CPU_Pred,interval="prediction", level=0.95)
## fit lwr upr
## 1 516.4685 252.2519 780.6851
cat("Model 2 Prediction:","\n")
## Model 2 Prediction:
exp(predict(model2,CPU_Pred,interval="prediction", level=0.95))
## fit lwr upr
## 1 277.723 55.17907 1397.813
Analysis: Model 1’s predicted CPU Performance is 516.5 with a 95% confidence interval of (252.3,780.7). Model 2’s predicted CPU Performance is lower at 277.7 with a 95% confidence interval of (55.2,1387.8). Model 2’s confidence interval is much larger because of the log transformation applied to the data. However, each models predicted values fall within the 95% confidence interval of the other model (model 1’s prediction falls in the CI of model 2, etc) and the predictions are within the range of the data set.
We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.
# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
head(data2)
data2, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.ggplot(data=data2, aes(x=vendor, y=performance, color=vendor)) + geom_boxplot() +
xlab('Vendor') +
ylab('CPU Performance') +
ggtitle('Box Plot of CPU Performance for Each Vendor')
Analysis: CPU performance does appear to differ depending on the vendor utilized. The Vendor NAS has the widest range of reported CPU Performance compared to hp and honeywell. However, there is much more variability in CPU performance in NAS, whereas HP has the lowest CPU Performance, but less variability in the CPU performance.
anovtableCPU <- aov(performance ~ vendor, data = data2)
summary(anovtableCPU)
## Df Sum Sq Mean Sq F value Pr(>F)
## vendor 2 154494 77247 6.027 0.00553 **
## Residuals 36 461443 12818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis: The null hypothesis for the \(F-test\) is means are equal for the CPU performance for all three vendors. Since the \(F-test\) \(Statistic\) is \(0.0055\), which is less than our \(\alpha\) of \(0.05\), we reject the null hypothesis and conclude that at least one of the vendors has a different mean CPU Performance.
tukey.test <- TukeyHSD(anovtableCPU, "vendor", conf.level = 0.95)
tukey.test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = performance ~ vendor, data = data2)
##
## $vendor
## diff lwr upr p adj
## hp-honeywell -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320 16.82659 216.0398 0.0188830
## nas-hp 140.46617 18.11095 262.8214 0.0214092
plot(tukey.test)
Analysis: From the Tukey test for the pairwise comparison of the mean CPU Performance for each vendor, we can see the mean CPU performance between NAS and Honeywell/HP is significantly different and have p values less than our \(\alpha\) of \(0.05\). Therefore, the Mean CPU Performance for NAS is statistically significantly different from hp and honeywell. Only the mean CPU performance 95% confidence intervals between hp and honeywell contain 0 in the graph above, indicating we reject the hypothesis that these two vendor’s mean CPU performance is statistically different.