Additional Material: ANOVA tutorial
https://datascienceplus.com/one-way-anova-in-r/
Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.
Raw data of phase shift, in hours, for the circadian rhythm experiment
| Treatment | Phase Shift (hr) |
|---|---|
| Control | 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27 |
| Knees | 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61 |
| Eyes | -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83 |
#Read the table from the txt file created from the data provided.Then assign Treatment as a categorical variable, and Phase_Shift as predicting variable.
A1_data <- read.table("A1_data.txt", header=T)
#Print some random rows of the data
A1_data[sample(nrow(A1_data),5),]
treatment <- A1_data$Treatment
phase_shift <-A1_data$Phase_Shift
treatmentType = as.factor(treatment)
#Creat the boxplot to check for outlier
boxplot(phase_shift~treatmentType, xlab = "Treatment Type", ylab = "Phase Shift (hrs)")
#Use ANOVA and provide the summary. The TOTAL Df and Sum Sq are obtained by summarize the result
modelA1 <- aov(phase_shift~treatmentType)
summary(modelA1)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatmentType 2 7.224 3.612 7.289 0.00447 **
## Residuals 19 9.415 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TOTAL_Df <- 2 + 19
TOTAL_Df
## [1] 21
TOTAL_SumSq <- 7.224 + 9.415
TOTAL_SumSq
## [1] 16.639
Consider the following incomplete R output:
| Source | Df | Sum of Squares | Mean Squares | F-statistics | p-value |
|---|---|---|---|---|---|
| Treatments | ? | ? | 3.6122 | ? | 0.004 |
| Error | ? | 9.415 | ? | ||
| TOTAL | ? | ? |
Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.
#Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.
#Obtain estimated means by using the model.tables with type = "means"
model.tables(modelA1, type="means")
## Tables of means
## Grand mean
##
## -0.7127273
##
## treatmentType
## Control Eyes Knees
## -0.3087 -1.551 -0.3357
## rep 8.0000 7.000 7.0000
\(\mu_1\) is the mean phase shift estimated for Control, that has the value of 0.3087
\(\mu_2\) is the mean phase shift estimated for Eyes, that has the value of -1.551
\(\mu_3\) is the mean phase shift estimated for Knees, that has the value of -0.3357
Use the ANOVA table in Question A1 to answer the following questions:
The null hypothesis \(H_0\) is when all \(\mu\) are equal, or \(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)
\(H_A\): \(\mu_1\) != \(\mu_2\) != \(\mu_3\)
1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic: \(F\)(2, 19)
1 pts What is the p-value of the ANOVA \(F\)-test? p-value = 0.00447
1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05. The p-value obtained from the model is 0.00447, which is less than 0.05. Therefore, we reject the null hypothesis that all means are equal, it also answers that the light treatment does affect phase shift statistically.
We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.
The data file includes the following columns:
The data is in the file “machine.csv”. To read the data in
R, save the file in your working directory (make sure you
have changed the directory if different from the R working directory)
and read the data using the R function
read.csv().
# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)
# Your code here...
plot(data$chmax,data$performance,xlab="Maximum Number of Channels",ylab="CPU performance")
According to the plot, we observe that the more channels the stronger
CPU performance is. However, this graph shows us that these variables do
not have a strong linear relationship. There are potentially some
outliers.
# Your code here...
cor(data$chmax,data$performance)
## [1] 0.6052093
This positive correlation coefficient explains a positive relationship between performance and chmax; however, the strength of this correlation is not strong.
I would not recommend a simple linear regression model for this relationship as the correlation coefficient is moderate and the scatterplot does not show a strong (straight-line) relationship enough to provide the accurate results.
A transformation would be appropriate to improve the linearity between two variables by using the log-transformation on chmax (The Maximum Number of Channels) when \(\lambda\) = 0. Then we could use the linear regression with the transformed data.
Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:
# Your code here...
model1 = lm(performance ~ chmax, data)
summary(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
The intercept \(\beta0\) = 37.2252
The slope \(\beta1\) = 3.7441
Variance \(\sigma^2\) = \(128.3^2\) = 16471.37
performance = 37.2252 + 3.7441chmax
For every unit increase in Maximum Number of Channels, the performance score will be increased by 3.7441 points/units
confint(model1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 15.817392 58.633048
## chmax 3.069251 4.418926
The confidence interval of \(\beta_1\) parameter is (3.069, 4,419) that does not include 0. We can conclude that \(\beta_1\) is statistically significant at the 95% confidence level.
We can conclude that the result is statistically sigificant if the p-value of the variable less than the significant level.The p-value of \(\beta_1\) in this case is less than 2.2e-16, and is smaller than the \(\alpha\)-level of 0.01, we can conclude that \(\beta_1\) is statistically significantly possitive.
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
# Your code here...
plot(data$chmax,data$performance,xlab="Maximum Number of Channels",ylab="CPU performance")
abline(model1,col='red')
Model Assumption(s) it checks: Scatterplot checks the Linearity
Interpretation: The scattterplot shows not a strong linear relationship between CPU performance and maximum number of channels. With the regression line included on the plot, we can visually observe that with the increasing maximum number of channels (starting at ~20) the relationship becomes weaker with larger variance.
# Your code here...
plot(model1$fitted.values,model1$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h=0,lty=2)
Model Assumption(s) it checks: Constant variance assumption
Interpretation: The constant variance assumptoin does not hold. We can see that the spread of this residuals plot is not consistant after the approximate fitted value of 100. The residuals are increased with the increasing fitted values.
# Your code here...
hist(residuals(model1),main="Histogram of residuals",xlab="Residuals")
qqnorm(residuals(model1))
abline(0,1,lty=1,col="red")
Model Assumption(s) it checks: Normality assumption
Interpretation: On the histogram plot, we can see that it’s not showing a perfect symmetry and has a gap on the graph. In addition, both tails on the QQ plot are aparted from the straight line. Both plots confirms that the normality assumption does not hold.
boxCox()) in car() package or
(boxcox()) in MASS() package to find the
optimal \(\lambda\) value rounded to
the nearest half integer. What transformation of the response, if any,
does it suggest to perform?# Your code here...
library(car)
## Loading required package: carData
library(carData)
boxCox = boxCox(data$performance ~ data$chmax)
lambda <- boxCox$x[which.max(boxCox$y)]
lambda <- round(lambda/0.5)*0.5
lambda
## [1] 0
Because the \(\lambda\) = 0, we can perform a log transformation to normalize the response variable since both normality and constant variance assumption do not hold.
# Your code here...
model2 <- lm(log(performance)~log(chmax+1), data = data)
summary(model2)
##
## Call:
## lm(formula = log(performance) ~ log(chmax + 1), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22543 -0.59429 0.01065 0.59287 1.85995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47655 0.14152 17.5 <2e-16 ***
## log(chmax + 1) 0.64819 0.05401 12.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared: 0.4103, Adjusted R-squared: 0.4074
## F-statistic: 144 on 1 and 207 DF, p-value: < 2.2e-16
summary(model1)$r.squared
## [1] 0.3662783
The R-squared value of model1 = 0.3662 and model2 = 0.4103
We can see that there is an improvement after the log transformation as R-squared in model2 is larger.
# Your code here...
#Scatterplot
plot(log(data$chmax),log(data$performance),xlab="Maximum Number of Channels",ylab="CPU performance")
abline(model2,col='red')
#Residual plot
plot(model2$fitted.values,model2$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h=0,lty=2)
#Histogram plot
hist(residuals(model2),main="Histogram of residuals",xlab="Residuals")
# QQ plot
qqnorm(residuals(model2))
abline(0,1,lty=1,col="red")
All the plots from model2 are visually performing better than
model1’s with all assumptions hold. We can conclude that
model2 is a better fit.
Suppose we are interested in predicting CPU performance when
chmax = 128. Please make a prediction using both
model1 and model2 and provide the 95% prediction
interval of each prediction on the original scale of the response,
performance. What observations can you make about the result in
the context of the problem?
# Your code here...
#create new data points for chmax
new_data <- data.frame(chmax=128)
#Model 1 prediction
predict1<- predict(model1,new_data,interval="prediction", level = 0.95)
predict1
## fit lwr upr
## 1 516.4685 252.2519 780.6851
#Model 2 prediction
predict2<- exp(predict(model2,new_data,interval="prediction", level=0.95))
head(predict2,1)
## fit lwr upr
## 1 277.723 55.17907 1397.813
model1 predicts a value of 516.4685 with the confidence interval (252.2519,780.6851) model2 predicts a value of 277.723 with the confidence interval (55.17907,1397.813)
Both predictions are within the 95% confidence intervals. The CI’s of model2 is much larger due to the log transformation
We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.
# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
data2, create a boxplot of
performance and vendor, with performance on
the vertical axis. Interpret the plots.# Your code here...
boxplot(data2$performance~data2$vendor, xlab = "Vendors", ylab = "Performance")
The CPU performance is different among three vendors. NAS has the
largest range of performance comparing to the others two vendor while HP
has the smallest. There are outliers on Honeywell performance.
# Your code here...
modelC <- aov(data2$performance~data2$vendor)
summary(modelC)
## Df Sum Sq Mean Sq F value Pr(>F)
## data2$vendor 2 154494 77247 6.027 0.00553 **
## Residuals 36 461443 12818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject the null hypothesis that the means of the three vendors are equal since the Pr(>F) = 0.00553 and less that \(\alpha\)-level of 0.05. The means are significantly different from three vendors.
# Your code here...
TukeyHSD(modelC)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data2$performance ~ data2$vendor)
##
## $`data2$vendor`
## diff lwr upr p adj
## hp-honeywell -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320 16.82659 216.0398 0.0188830
## nas-hp 140.46617 18.11095 262.8214 0.0214092
On the pairs of NAS with Honeywell and HP, they have the p-adj less than 0.05 and have the lower/upper bounds with all possitive values. We can conclude that mean performance of NAS is statistically different, i.e. higher to be specific, than Honeywell’s and HP’s. On the pair HP-Honeywell, since the bounds include 0 value, we can reject the hypothesis that these two vendors mean performance is statistically different.