Part A. ANOVA

Additional Material: ANOVA tutorial

https://datascienceplus.com/one-way-anova-in-r/

Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.

Raw data of phase shift, in hours, for the circadian rhythm experiment

Treatment Phase Shift (hr)
Control 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27
Knees 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61
Eyes -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83

Question A1 - 3 pts

#Read the table from the txt file created from the data provided.Then assign Treatment as a categorical variable, and Phase_Shift as predicting variable.
A1_data <- read.table("A1_data.txt", header=T)

#Print some random rows of the data
A1_data[sample(nrow(A1_data),5),]
treatment <- A1_data$Treatment
phase_shift <-A1_data$Phase_Shift
treatmentType = as.factor(treatment)

#Creat the boxplot to check for outlier 
boxplot(phase_shift~treatmentType, xlab = "Treatment Type", ylab = "Phase Shift (hrs)")

#Use ANOVA and provide the summary. The TOTAL Df and Sum Sq are obtained by summarize the result
modelA1 <- aov(phase_shift~treatmentType)
summary(modelA1)
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## treatmentType  2  7.224   3.612   7.289 0.00447 **
## Residuals     19  9.415   0.496                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TOTAL_Df <- 2 + 19
TOTAL_Df
## [1] 21
TOTAL_SumSq <- 7.224 + 9.415
TOTAL_SumSq
## [1] 16.639

Consider the following incomplete R output:

Source Df Sum of Squares Mean Squares F-statistics p-value
Treatments ? ? 3.6122 ? 0.004
Error ? 9.415 ?
TOTAL ? ?

Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.

Question A2 - 3 pts

#Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.

#Obtain estimated means by using the model.tables with type = "means"
model.tables(modelA1, type="means")
## Tables of means
## Grand mean
##            
## -0.7127273 
## 
##  treatmentType 
##     Control   Eyes   Knees
##     -0.3087 -1.551 -0.3357
## rep  8.0000  7.000  7.0000

\(\mu_1\) is the mean phase shift estimated for Control, that has the value of 0.3087

\(\mu_2\) is the mean phase shift estimated for Eyes, that has the value of -1.551

\(\mu_3\) is the mean phase shift estimated for Knees, that has the value of -0.3357

Question A3 - 5 pts

Use the ANOVA table in Question A1 to answer the following questions:

  1. 1 pts Write the null hypothesis of the ANOVA \(F\)-test, \(H_0\)

The null hypothesis \(H_0\) is when all \(\mu\) are equal, or \(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)

  1. 1 pts Write the alternative hypothesis of the ANOVA \(F\)-test, \(H_A\)

\(H_A\): \(\mu_1\) != \(\mu_2\) != \(\mu_3\)

  1. 1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic: \(F\)(2, 19)

  2. 1 pts What is the p-value of the ANOVA \(F\)-test? p-value = 0.00447

  3. 1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05. The p-value obtained from the model is 0.00447, which is less than 0.05. Therefore, we reject the null hypothesis that all means are equal, it also answers that the light treatment does affect phase shift statistically.

Part B. Simple Linear Regression

We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.

The data file includes the following columns:

The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)

Question B1: Exploratory Data Analysis - 9 pts

  1. 3 pts Use a scatter plot to describe the relationship between CPU performance and the maximum number of channels. Describe the general trend (direction and form). Include plots and R-code used.
# Your code here...
plot(data$chmax,data$performance,xlab="Maximum Number of Channels",ylab="CPU performance")

According to the plot, we observe that the more channels the stronger CPU performance is. However, this graph shows us that these variables do not have a strong linear relationship. There are potentially some outliers.

  1. 3 pts What is the value of the correlation coefficient between performance and chmax? Please interpret the strength of the correlation based on the correlation coefficient.
# Your code here...
cor(data$chmax,data$performance)
## [1] 0.6052093

This positive correlation coefficient explains a positive relationship between performance and chmax; however, the strength of this correlation is not strong.

  1. 2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

I would not recommend a simple linear regression model for this relationship as the correlation coefficient is moderate and the scatterplot does not show a strong (straight-line) relationship enough to provide the accurate results.

  1. 1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

A transformation would be appropriate to improve the linearity between two variables by using the log-transformation on chmax (The Maximum Number of Channels) when \(\lambda\) = 0. Then we could use the linear regression with the transformed data.

Question B2: Fitting the Simple Linear Regression Model - 11 pts

Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:

# Your code here...
model1 = lm(performance ~ chmax, data)
  1. 3 pts What are the model parameters and what are their estimates?
summary(model1)
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.47  -42.20  -22.20   20.31  867.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2252    10.8587   3.428 0.000733 ***
## chmax         3.7441     0.3423  10.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3632 
## F-statistic: 119.6 on 1 and 207 DF,  p-value: < 2.2e-16

The intercept \(\beta0\) = 37.2252

The slope \(\beta1\) = 3.7441

Variance \(\sigma^2\) = \(128.3^2\) = 16471.37

  1. 2 pts Write down the estimated simple linear regression equation.

performance = 37.2252 + 3.7441chmax

  1. 2 pts Interpret the estimated value of the \(\beta_1\) parameter in the context of the problem.

For every unit increase in Maximum Number of Channels, the performance score will be increased by 3.7441 points/units

  1. 2 pts Find a 95% confidence interval for the \(\beta_1\) parameter. Is \(\beta_1\) statistically significant at this level?
confint(model1, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 15.817392 58.633048
## chmax        3.069251  4.418926

The confidence interval of \(\beta_1\) parameter is (3.069, 4,419) that does not include 0. We can conclude that \(\beta_1\) is statistically significant at the 95% confidence level.

  1. 2 pts Is \(\beta_1\) statistically significantly positive at an \(\alpha\)-level of 0.01? What is the approximate p-value of this test?

We can conclude that the result is statistically sigificant if the p-value of the variable less than the significant level.The p-value of \(\beta_1\) in this case is less than 2.2e-16, and is smaller than the \(\alpha\)-level of 0.01, we can conclude that \(\beta_1\) is statistically significantly possitive.

Question B3: Checking the Assumptions of the Model - 8 pts

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

  1. 2 pts Scatterplot of the data with chmax on the x-axis and performance on the y-axis
# Your code here...
plot(data$chmax,data$performance,xlab="Maximum Number of Channels",ylab="CPU performance")
abline(model1,col='red')

Model Assumption(s) it checks: Scatterplot checks the Linearity

Interpretation: The scattterplot shows not a strong linear relationship between CPU performance and maximum number of channels. With the regression line included on the plot, we can visually observe that with the increasing maximum number of channels (starting at ~20) the relationship becomes weaker with larger variance.

  1. 3 pts Residual plot - a plot of the residuals, \(\hat\epsilon_i\), versus the fitted values, \(\hat{y}_i\)
# Your code here...
plot(model1$fitted.values,model1$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h=0,lty=2)

Model Assumption(s) it checks: Constant variance assumption

Interpretation: The constant variance assumptoin does not hold. We can see that the spread of this residuals plot is not consistant after the approximate fitted value of 100. The residuals are increased with the increasing fitted values.

  1. 3 pts Histogram and q-q plot of the residuals
# Your code here...
hist(residuals(model1),main="Histogram of residuals",xlab="Residuals")

qqnorm(residuals(model1))
abline(0,1,lty=1,col="red")

Model Assumption(s) it checks: Normality assumption

Interpretation: On the histogram plot, we can see that it’s not showing a perfect symmetry and has a gap on the graph. In addition, both tails on the QQ plot are aparted from the straight line. Both plots confirms that the normality assumption does not hold.

Question B4: Improving the Fit - 10 pts

  1. 2 pts Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?
# Your code here...
library(car)
## Loading required package: carData
library(carData)

boxCox = boxCox(data$performance ~ data$chmax)

lambda <- boxCox$x[which.max(boxCox$y)]
lambda <- round(lambda/0.5)*0.5
lambda
## [1] 0

Because the \(\lambda\) = 0, we can perform a log transformation to normalize the response variable since both normality and constant variance assumption do not hold.

  1. 2 pts Create a linear regression model, named model2, that uses the log transformed performance as the response, and the log transformed chmax as the predictor. Note: The variable chmax has a couple of zero values which will cause problems when taking the natural log. Please add one to the predictor before taking the natural log of it
# Your code here...
model2 <- lm(log(performance)~log(chmax+1), data = data)
summary(model2)
## 
## Call:
## lm(formula = log(performance) ~ log(chmax + 1), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22543 -0.59429  0.01065  0.59287  1.85995 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.47655    0.14152    17.5   <2e-16 ***
## log(chmax + 1)  0.64819    0.05401    12.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared:  0.4103, Adjusted R-squared:  0.4074 
## F-statistic:   144 on 1 and 207 DF,  p-value: < 2.2e-16
  1. 2 pts Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?
summary(model1)$r.squared
## [1] 0.3662783

The R-squared value of model1 = 0.3662 and model2 = 0.4103

We can see that there is an improvement after the log transformation as R-squared in model2 is larger.

  1. 4 pts Similar to Question B3, assess and interpret all model assumptions of model2. A model is considered a good fit if all assumptions hold. Based on your interpretation of the model assumptions, is model2 a good fit?
# Your code here...
#Scatterplot
plot(log(data$chmax),log(data$performance),xlab="Maximum Number of Channels",ylab="CPU performance")
abline(model2,col='red')

#Residual plot
plot(model2$fitted.values,model2$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(h=0,lty=2)

#Histogram plot
hist(residuals(model2),main="Histogram of residuals",xlab="Residuals")

# QQ plot
qqnorm(residuals(model2))
abline(0,1,lty=1,col="red")

All the plots from model2 are visually performing better than model1’s with all assumptions hold. We can conclude that model2 is a better fit.

Question B5: Prediction - 3 pts

Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?

# Your code here...
#create new data points for chmax
new_data <- data.frame(chmax=128)

#Model 1 prediction
predict1<- predict(model1,new_data,interval="prediction", level = 0.95)
predict1
##        fit      lwr      upr
## 1 516.4685 252.2519 780.6851
#Model 2 prediction
predict2<- exp(predict(model2,new_data,interval="prediction", level=0.95))
head(predict2,1)
##       fit      lwr      upr
## 1 277.723 55.17907 1397.813

model1 predicts a value of 516.4685 with the confidence interval (252.2519,780.6851) model2 predicts a value of 277.723 with the confidence interval (55.17907,1397.813)

Both predictions are within the 95% confidence intervals. The CI’s of model2 is much larger due to the log transformation

Part C. ANOVA - 8 pts

We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.

# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
  1. 2 pts Using data2, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.
# Your code here...
boxplot(data2$performance~data2$vendor, xlab = "Vendors", ylab = "Performance")

The CPU performance is different among three vendors. NAS has the largest range of performance comparing to the others two vendor while HP has the smallest. There are outliers on Honeywell performance.

  1. 3 pts Perform an ANOVA F-test on the means of the three vendors. Using an \(\alpha\)-level of 0.05, can we reject the null hypothesis that the means of the three vendors are equal? Please interpret.
# Your code here...
modelC <- aov(data2$performance~data2$vendor)
summary(modelC)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## data2$vendor  2 154494   77247   6.027 0.00553 **
## Residuals    36 461443   12818                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis that the means of the three vendors are equal since the Pr(>F) = 0.00553 and less that \(\alpha\)-level of 0.05. The means are significantly different from three vendors.

  1. 3 pts Perform a Tukey pairwise comparison between the three vendors. Using an \(\alpha\)-level of 0.05, which means are statistically significantly different from each other?
# Your code here...
TukeyHSD(modelC)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data2$performance ~ data2$vendor)
## 
## $`data2$vendor`
##                    diff        lwr      upr     p adj
## hp-honeywell  -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320   16.82659 216.0398 0.0188830
## nas-hp        140.46617   18.11095 262.8214 0.0214092

On the pairs of NAS with Honeywell and HP, they have the p-adj less than 0.05 and have the lower/upper bounds with all possitive values. We can conclude that mean performance of NAS is statistically different, i.e. higher to be specific, than Honeywell’s and HP’s. On the pair HP-Honeywell, since the bounds include 0 value, we can reject the hypothesis that these two vendors mean performance is statistically different.