Additional Material: ANOVA tutorial
https://datascienceplus.com/one-way-anova-in-r/
Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.
Raw data of phase shift, in hours, for the circadian rhythm experiment
| Treatment | Phase Shift (hr) |
|---|---|
| Control | 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27 |
| Knees | 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61 |
| Eyes | -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83 |
## Read data into R
shift_hours = read.table("anova_data.txt",header = T)
## Response Variable
phaseshift = shift_hours$PhaseShift
treatment = shift_hours$Treatment
## Convert into categorical variable in R
treatment = as.factor(treatment)
model = aov(phaseshift ~ treatment)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.224 3.612 7.289 0.00447 **
## Residuals 19 9.415 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Consider the following incomplete R output:
| Source | Df | Sum of Squares | Mean Squares | F-statistics | p-value |
|---|---|---|---|---|---|
| Treatments | 2 | 7.224 | 3.6122 | 7.289 | 0.004 |
| Error | 19 | 9.415 | 0.496 | ||
| TOTAL | 21 | 16.639 |
Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.
Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.
Solution A2:
## Obtain estimated means
model.tables(model, type="means")
## Tables of means
## Grand mean
##
## -0.7127273
##
## treatment
## Control Eyes Knees
## -0.3087 -1.551 -0.3357
## rep 8.0000 7.000 7.0000
Control: \(\mu_1 = -0.3087\)
Eyes: \(\mu_2 = -1.551\)
Knees: \(\mu_3 = -0.3357\)
Use the ANOVA table in Question A1 to answer the following questions:
Solution A3.A: The null hypothesis of the ANOVA \(F\)-test is \(H_0:\) \(\mu_1\)=\(\mu_2\)=\(\mu_3\) The means are equal.
Solution A3.B: The alternative hypothesis of the ANOVA \(F\)-test, \(H_1:\) \(\mu_1≠\mu_2\),\(\mu_1≠\mu_3\), and \(\mu_2≠\mu_3\) The means are not equal.
Solution A3.C: \(F\)(2, 19)
Solution A3.D: p-value of the ANOVA \(F\)-test = 0.00447
Solution A3.E: For a \(\alpha\)-level of 0.05 we would reject the the null hypotheses at 95% confidence that all means are equal because the p-value of 0.00447 is less than 0.05, hence the light treatments used for this study do affect phase shift.
We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.
The data file includes the following columns:
The data is in the file “machine.csv”. To read the data in
R, save the file in your working directory (make sure you
have changed the directory if different from the R working directory)
and read the data using the R function
read.csv().
# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)
Solution B1.A:
chmax = data$chmax
vendor = data$vendor
performance = data$performance
plot(chmax,performance,ylab="Level of Performance",xlab="Max CPU Channels",
main="CPU Channels vs CPU Performance")
abline(lm(performance ~ chmax, data = data), col = "blue")
In general we can gather from the scatter plot that there is a slight
positive correlation between the number of CPU channels versus the level
of CPU performance. With respect to the form, it can be inferred from
the scatter plot that as the number of CPU channels increase, the
variance in the CPU performance also increases. Also, if you look at the
fitted vs residuals plot you can see a cone shape that can elude to a
non constant variance.
Solution B1.B:
# Correlation Coefficient
cor(chmax, performance)
## [1] 0.6052093
Because the correlation coefficient of 0.6052093 is greater than 0, there is a positive relationship between the number of CPU channels and the level of CPU performance. Because 0.5 < 0.6052093 < 0.7 the strength of the relationship is “Moderate”.
Solution B1.C: The assumptions of a simple linear regression model include Linearity/mean zero assumption, constant variance assumption, independence assumption, and normality assumption.
model1 = lm(performance ~ chmax, data)
plot(fitted(model1),resid(model1),main="Fitted vs Residuals",
xlab="Fitted Values")
Based on the fitted versus residuals plot it is clear that the constant
variance assumption is violated. In combination with a “Moderate”
strength correlation coefficient I would not recommend a simple linear
regression model for this data set.
Solution B1.D: Because the variance in CPU performance increases with increasing number of CPU channels a data transformation would be acceptable.Specifically because the constant variance assumption does not, hold a transformation that would normalize the variance would help improve the linear model.
Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:
# Your code here...
model1 = lm(performance ~ chmax, data)
summary(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
# Your code here...
summary(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
betas = coef(model1)
summary(betas)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.744 12.114 20.485 20.485 28.855 37.225
print(model1$coefficients)
## (Intercept) chmax
## 37.225220 3.744088
Solution B2.A: The model parameters are output of the code above and the estimates are the following \(\beta_0\) or the intercept and the estimate of \(\hat\beta_0\) is 37.225220. \(\beta_1\) or the slope and the estimate of \(\hat\beta_1\) is 3.744088. The variance \(\sigma^2\) and the estimate of \(\hat\sigma^2\) is 16471.37
Solution B2.B:
performance = 3.744088*chmax + 37.225220
Solution B2.C: The estimated value of the \(\beta_1\) parameter in the context of this problem is the slope, so for every increase in CPU channels, the CPU performance increases by 3.744088 units of performance.
## Confidence intervals for the coefficients
confint(model1,level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 15.817392 58.633048
## chmax 3.069251 4.418926
summary(model1)$coefficients['chmax','t value']
## [1] 10.93809
t.test(chmax, performance = NULL,conf.level = 0.95)
##
## One Sample t-test
##
## data: chmax
## t = 10.159, df = 208, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 14.72277 21.81312
## sample estimates:
## mean of x
## 18.26794
Solution B2.D: The 95% confidence interval for the \(\hat\beta_1\) is (3.069251,4.418926). Yes, \(\hat\beta_1\) is statistically significant at this level.
## Confidence intervals for the coefficients
confint(model1,level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 8.994891 65.455549
## chmax 2.854185 4.633991
summary(model1)$coefficients['chmax','t value']
## [1] 10.93809
t_value = summary(model1)$coefficients['chmax','t value']
pvalue = pt(abs(t_value),207,lower=F)
pvalue
## [1] 1.423882e-22
Solution B2.E: Because \(\hat\beta_1\) is 3.744088, it is statistically significantly positive because the p-value is 1.423882e-22 which is <<< than alpha level of 0.01
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
plot(chmax,performance,ylab="Level of Performance",xlab="Max CPU Channels",
main="CPU Channels vs CPU Performance")
abline(lm(performance ~ chmax, data = data), col = "blue")
Solution B3.A: Model Assumption(s) it checks: Linearity and Constant variance
Interpretation: There is a slight positive correlation between the number of CPU channels versus the level of CPU performance. With respect to the form, it can be inferred from the scatter plot that as the number of CPU channels increase, the variance in the CPU performance also increases.Also, if you look at the the plot you can see a cone shape that can elude to a non constant variance.
plot(fitted(model1),resid(model1),main="Fitted vs Residuals",
xlab="Fitted Values")
abline(0,0)
Solution B3.B: Model Assumption(s) it checks: It checks for linearity, constant variance, and potentially if the response data are not independent
Interpretation: The residuals show larger variance as the fitted values increases. There for showing that constant variance is violated. There are some clustering of residuals which could also mean that independence assumption could possibly not hold.
qqnorm(resid(model1),main="QQ-Plot of Residuals")
qqline(resid(model1))
hist(resid(model1),main="Histogram of Residuals")
Solution B3.C: Model Assumption(s) it checks: Normality
Interpretation: Because of the curvature at the tail ends of the theoretical quantiles is a significant tell for non-normality. Also, based on the distribution of the histogram and the distribution is slightly not uniform and there are gaps in data that persist as the residuals increase hence a violation in the normality assumption.
boxCox()) in car() package or
(boxcox()) in MASS() package to find the
optimal \(\lambda\) value rounded to
the nearest half integer. What transformation of the response, if any,
does it suggest to perform?library (car)
## Loading required package: carData
library (MASS)
boxmodel = boxCox(model1)
(lambda <- boxmodel$x[which.max(boxmodel$y)])
## [1] -0.1010101
summary(boxmodel)
## Length Class Mode
## x 100 -none- numeric
## y 100 -none- numeric
Solution B4.A: The optimal \(\lambda\) value rounded to the nearest half integer is equal to 0.Because lambda is equal to 0 a normal logarithmic transformation could be used to normalize the response.
Solution B4.B:
model2 = lm(log(performance)~log(chmax+1))
summary(model2)
##
## Call:
## lm(formula = log(performance) ~ log(chmax + 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22543 -0.59429 0.01065 0.59287 1.85995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47655 0.14152 17.5 <2e-16 ***
## log(chmax + 1) 0.64819 0.05401 12.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared: 0.4103, Adjusted R-squared: 0.4074
## F-statistic: 144 on 1 and 207 DF, p-value: < 2.2e-16
Solution B4.C:
cat("R-Squared values Before Transformation (Model 1): ", summary(model1)$r.squared)
## R-Squared values Before Transformation (Model 1): 0.3662783
cat("R-Squared values After Transformation (Model 2): ", summary(model2)$r.squared)
## R-Squared values After Transformation (Model 2): 0.4102926
There is a improvement in the explanatory power of the model, which tells us that by transforming the model we were able to improve our linearity assumption of the model.
plot(log(chmax+1),log(performance),ylab="Level of Performance",xlab="Max CPU Channels",
main="CPU Channels vs CPU Performance")
abline(lm(log(performance) ~ log(chmax+1), data = data), col = "blue")
plot(fitted(model2),resid(model2),main="Fitted vs Residuals",
xlab="Fitted Values")
abline(0,0)
qqnorm(resid(model2),main="QQ-Plot of Residuals")
qqline(resid(model2))
hist(resid(model2),main="Histogram of Residuals")
Solution B4.D: Linearity Assumption: The data now is evenly distributed in a linear increasing fashion above and below the linear fitted line of the transformed data. This assumption holds True.
Constant Variance Assumption:The residuals are evenly scatter above and below the model line at random. This assumption holds True.
Independence Assumption: There are now no visible clustering effects going on in the data. From the histogram there also is no sign of clustering effects.Residual effects are likely uncorrelated. This assumption holds True.
Normality Assumption: This histogram shows a equally distributed plot around the center with no gaps in the data and there is slightly less tailing on the QQ plot. Also, the quantiles of the residuals line up well with the normal quantiles and only slightly depart from the line on the tail ends. This Assumption holds True.
Suppose we are interested in predicting CPU performance when
chmax = 128. Please make a prediction using both
model1 and model2 and provide the 95% prediction
interval of each prediction on the original scale of the response,
performance. What observations can you make about the result in
the context of the problem?
chmax_128 = data.frame(chmax=128)
predict(model1,chmax_128,interval="prediction", level=0.95)
## fit lwr upr
## 1 516.4685 252.2519 780.6851
exp(predict(model2,chmax_128,interval="prediction", level=0.95))
## fit lwr upr
## 1 277.723 55.17907 1397.813
Solution B5: Model 1s prediction of CPU performance at a chmax of 128 is 516.4685. With a 95% confidence interval of (252.2519,780.6851) Model 2s prediction of CPU performance at a chmax of 128 is 277.723.With a 95% confidence interval of (55.17907,1397.813)
Both predicted results fall in both confidence intervals i.e. 516.4685 is within (55.17907,1397.813) and 277.723 is within (252.2519,780.6851). Model 2s confidence interaval is much larger than model1 because of the logarithmic transformation.
We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.
# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
data2, create a boxplot of
performance and vendor, with performance on
the vertical axis. Interpret the plots.performance1 = data2$performance
vendor1 = data2$vendor
boxplot(performance1~vendor1, xlab = "Vendor", ylab = "CPU Performance")
Solution C1: The variability of NAS is the highest but overall they have performance data over a wider spread than both HP and Honeywell. The variability of Honeywell is the slightly more than HP but overall Honeywell performance data is around a lower performance level than that of NAS. HP has the lowest performing CPU out of all 3 vendors but they also have the least variability present in their performance data.
model3 = aov(performance1 ~ vendor1)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## vendor1 2 154494 77247 6.027 0.00553 **
## Residuals 36 461443 12818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(model3, type="means")
## Tables of means
## Grand mean
##
## 112.8718
##
## vendor1
## honeywell hp nas
## 60.46 36.43 176.9
## rep 13.00 7.00 19.0
Solution C2: Because the F-statistic test is equal to 0.00553 which is less than our \(\alpha\)-level of 0.05 we can reject the null hypothesis. Therefor at least some means are not equivalent.
TukeyHSD(model3, "vendor1", conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = performance1 ~ vendor1)
##
## $vendor1
## diff lwr upr p adj
## hp-honeywell -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320 16.82659 216.0398 0.0188830
## nas-hp 140.46617 18.11095 262.8214 0.0214092
Solution C3: We can see that nas-honeywell and nas-hp have p values less than our alpha of 0.05 which means that NAS CPU performance is statistically significantly different from HP and Honeywell. On the other hand because hp-honeywell have a p-value of 0.8934 which is greater than our alpha of 0.05 which means that HP and Honeywell have statistically similar CPU performance.