Part A. ANOVA

Additional Material: ANOVA tutorial

https://datascienceplus.com/one-way-anova-in-r/

Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.

Raw data of phase shift, in hours, for the circadian rhythm experiment

Treatment Phase Shift (hr)
Control 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27
Knees 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61
Eyes -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83

Question A1 - 3 pts

Consider the following incomplete R output:

Source Df Sum of Squares Mean Squares F-statistics p-value
Treatments ? ? 3.6122 ? 0.004
Error ? 9.415 ?
TOTAL ? ?

Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.

#Create Dataframe
Phase_Shift <- c(0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27,
                 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61,
                 -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83)
Treatments  <- c("Control", "Control", "Control", "Control", "Control", "Control", "Control","Control",
  "Knees","Knees","Knees","Knees","Knees","Knees","Knees",
                 "Eyes","Eyes","Eyes","Eyes","Eyes","Eyes","Eyes")

Treatments <- as.factor(Treatments)

df <- data.frame(Phase_Shift,Treatments )
head(df)
#ANOVA
res.aov <- aov(Phase_Shift ~ Treatments, data = df )
#summary
summary(res.aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Treatments   2  7.224   3.612   7.289 0.00447 **
## Residuals   19  9.415   0.496                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer:

Note: Answers are repeated here from the ANOVA table above

DfT reatments = k − 1 = 2

DfError = N − k = 22 − 3 = 19

DfT otal = DfT reatments + DfErorr = (N − k) + (k − 1) = N − 1 = 21

SSTR = MSTR × (k − 1) = 3.6122 × 2 = 7.224

SST = SSE + SSTR = 7.224 + 9.415 = 16.639

MSE = SSE/(N − k) = 9.415/19 = 0.4955

F-test = MSTR/MSE = 3.6122/0.4955 = 7.29

# calculating means of the parameters
model.tables(res.aov,type="means")
## Tables of means
## Grand mean
##            
## -0.7127273 
## 
##  Treatments 
##     Control   Eyes   Knees
##     -0.3087 -1.551 -0.3357
## rep  8.0000  7.000  7.0000

Question A2 - 3 pts

Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.

• µ1: mean phase shift for subjects in Control group. Its estimate, \(\mu_1\), is -0.3087

• µ2: mean phase shift for subjects in Knees group. Its estimate, \(\mu_2\), is -0.3357

• µ3: mean phase shift for subjects in Eyes group. Its estimate, \(\mu_3\), is -1.551

Question A3 - 5 pts

Use the ANOVA table in Question A1 to answer the following questions:

  1. 1 pts Write the null hypothesis of the ANOVA \(F\)-test, \(H_0\)

H0 :µ1=µ2=µ3

  1. 1 pts Write the alternative hypothesis of the ANOVA \(F\)-test, \(H_A\)

HA : at least 2 of the means are not equal (\(\mu_1\) not equal \(\mu_2\) and/or \(\mu_1\) not equal =\(\mu_3\) and/or \(\mu_3\) not equal \(\mu_2\))

  1. 1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic: \(F\)(2, 19)

  2. 1 pts What is the p-value of the ANOVA \(F\)-test?

p-value = 0.00447, as shown from ANOVA table

  1. 1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05.

Yes, We reject the null hypothesis that all three means are equal because the p-value is much smaller than 0.05. Therefore, the mean of the phase shift is not the same for all three treatment groups, and we conclude that light treatment does affect phase shift. We can conclude this because if any one of the three treatments is different, then at least one of the treatments must be different from the control group.

Part B. Simple Linear Regression

We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.

The data file includes the following columns:

The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)

Question B1: Exploratory Data Analysis - 9 pts

  1. 3 pts Use a scatter plot to describe the relationship between CPU performance and the maximum number of channels. Describe the general trend (direction and form). Include plots and R-code used.
# Your code here...
ggplot(data, aes(chmax, performance) ) +
  geom_point() +
  stat_smooth()+theme_economist()+ ggtitle("Part_B1-Fig1: Scatter Plot of Performance vs. CPU")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Answer:

  • The data appears to have a lot more points bunched up in the lower half of the predictor variables. That is more channels in CPU and performance data points clustering under the 50 level mark which is indicative that constant variance has been violated. However, as chmax increases, there are a few performances data points that also increases, giving it a very slight upward trend
  1. 3 pts What is the value of the correlation coefficient between performance and chmax? Please interpret the strength of the correlation based on the correlation coefficient.
# Your code here...
# Build the model
linear_model <- lm(performance ~ chmax, data = data)
linear_model
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Coefficients:
## (Intercept)        chmax  
##      37.225        3.744
summary(linear_model)
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.47  -42.20  -22.20   20.31  867.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2252    10.8587   3.428 0.000733 ***
## chmax         3.7441     0.3423  10.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3632 
## F-statistic: 119.6 on 1 and 207 DF,  p-value: < 2.2e-16
cor.test(data$performance , data$chmax, method ='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  data$performance and data$chmax
## t = 10.938, df = 207, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5115078 0.6846867
## sample estimates:
##       cor 
## 0.6052093

#Answer:

  • The correlation, r is directionally positive; meaning as chmax increase, so does performance. The 0.61 is moderately strong
  1. 2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

Yes but with variable transformation. Although there’s a slight upward trend in the scatter plot; higher performance level as chmax increases which means as x increases, y also increases. However, it’s not as strong as it should be. In addition, due to bunching up in the lower half of the data, this model appears to suffer from heteroscedasticity, a violation of one of the 4 main assumptions of linear regression model. The linear regression model as-is is OK but not great. We can still run the regression but bear in mind, there’s a possibility coefficients may be biased.

  1. 1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

Yes, I think the data could be transformed as it is “bunched up” in the lower range of the predictor variable, chmax.This suggest that we can “stretch” out the data more and improves the model.A good one to try is Box-Cox transformation

Question B2: Fitting the Simple Linear Regression Model - 11 pts

Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:

# Your code here...
model1 = lm(performance ~ chmax, data)
print(model1)
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Coefficients:
## (Intercept)        chmax  
##      37.225        3.744
#confidence interval @95% confidence level

ci <- confint(model1, level = .95)
colnames(ci) <- c("L", "U")
rownames(ci) <- c("beta0", "beta1")
ci
##               L         U
## beta0 15.817392 58.633048
## beta1  3.069251  4.418926
  1. 3 pts What are the model parameters and what are their estimates?

Intercept, beta0: 37.225, beta1, slope(chmax): 3.744 and Residual standard error: 128.3 or variance is 16460.89 on 207 degrees of freedom

  1. 2 pts Write down the estimated simple linear regression equation.

performance= 37.225 + 3.744chmax

  1. 2 pts Interpret the estimated value of the \(\beta_1\) parameter in the context of the problem.

A 1 unit increase in the predictor variable, x (chmax) will result in an increase of 3.744 units in the response variable (performance)

  1. 2 pts Find a 95% confidence interval for the \(\beta_1\) parameter. Is \(\beta_1\) statistically significant at this level?

\(\beta_1\)_lower=3.069251 and \(\beta_1\)_upper=4.418926

  1. 2 pts Is \(\beta_1\) statistically significantly positive at an \(\alpha\)-level of 0.01? What is the approximate p-value of this test?

• H0 : β1 ≤ 0

• HA : β1 > 0

To check β1 > 0 this essentially becomes a one-sided test, and then proceed to compute p-values from it:

tval = summary(model1)$coefficients['chmax','t value']
df = model1$df.residual
pval = pt(tval, df, lower=F)
pval
## [1] 1.423882e-22

#Answer:

From above output, the p-value is 1.423882e-22 which is essentially zero and statistically significant at the 0.01 level, we can conclude that β1 is statistically significantly positive

Question B3: Checking the Assumptions of the Model - 8 pts

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

  1. 2 pts Scatterplot of the data with chmax on the x-axis and performance on the y-axis
# Your code here...
ggplot(data, aes(chmax, performance) ) +
  geom_point() +
  stat_smooth()+theme_economist()+ ggtitle("Part_B3-Fig2: Scatter Plot of Performance vs. CPU")

Model Assumption(s) it checks: Linearity/Mean Zero, Independence and Constant Variance

Interpretation:

#Answer:

  • Like the question B1-b, the scater plot displayed some clustering effects at the lower end of the data. For example, between 40-150 maximum channels, nearly all the data points is below trend line. There may be issue with larger values of chmax not evenly distributed across the model line. This could be violation of linearity assumption
  1. 3 pts Residual plot - a plot of the residuals, \(\hat\epsilon_i\), versus the fitted values, \(\hat{y}_i\)
# Your code here...
#get list of residuals 
res <- resid(model1)
#produce residual vs. fitted plot
plot(fitted(model1), res)

#add a horizontal line at 0 
abline(0,0)

Model Assumption(s) it checks: Independence (Uncorrelated errors) and Constant Variance

Interpretation:

#Answer:

There appear to be “bunching up” at the lower ranges, this is a feature of the data set. Most of the x predictor values were small which translated to lower fitted values (performances). Many of the predictors, chmax, have identical maximum channel values at the same performance levels. This results in 2 distinctive bands which does not mean there were correlation of data points. Correlation of data points means the past affects the present or future which is not the case here, ie,. dependence. Since there were not correlated errors, the independence assumption appears to hold. However, the plot does show heteroscedasticity with variance increasing as the fitted values increase. This is indicative the constant variance assumption has been violated.

  1. 3 pts Histogram and q-q plot of the residuals
# Your code here...
#create histogram of residuals
ggplot(data = data, aes(x = model1$residuals)) +
    geom_histogram(fill = 'yellow', color = 'black', bins = 30) +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')+theme_economist()

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res)

Model Assumption(s) it checks: Normality checks

Interpretation:

#Answer:

-The histogram of residuals is quite normal but with heavy tails while the Q-Q plot shows deviations on both ends of the curve. The histogram contained some outlier points which led to deviations of Q-Q plot at both ends. This means the model can be improved upon via box-cox transformations

Question B4: Improving the Fit - 10 pts

  1. 2 pts Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?
# Your code here...

hist(data$chmax)

bc <- boxcox((data$performance ~ data$chmax))

lambda <- bc$x[which.max(bc$y)]

#Answer:

  • Optimal lambda, rounding to nearest half-integer is zero

  • B/C lambda is zero, the literature recommends a log transformation of the response variable (performance) is needed

  1. 2 pts Create a linear regression model, named model2, that uses the log transformed performance as the response, and the log transformed chmax as the predictor. Note: The variable chmax has a couple of zero values which will cause problems when taking the natural log. Please add one to the predictor before taking the natural log of it
# Your code here...
#adding 1 to predictors before taking natural log
chmax_1 <- data$chmax+1
log_chmax_1 <- log(chmax_1)
#transformed response variable
transformed_y <- log(data$performance)
#model2 using transformed response variable
model2 <-  lm(transformed_y ~ log_chmax_1 )

#define plotting area
op <- par(pty = "s", mfrow = c(1, 3))
#Q-Q plot for original model
qqnorm(model1$residuals)
qqline(model1$residuals)

#Q-Q plot for Box-Cox transformed model
qqnorm(model2$residuals)
qqline(model2$residuals)
#residual plots
res2 <- resid(model2)
#produce residual vs. fitted plot
plot(fitted(model2), res2 ,main = 'Residuals vs fitted')

#display both Q-Q plots
par(op)

#Answer:

  • With log transformations, the q-q and residual plots looks much improved!

  • Normality of residuals looks much better and the residuals vs. fitted is now randomly scattered around zero.

  • This means there is an improvement of explanatory power of the model

  1. 2 pts Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?
print(paste0(c('Adj. R-square of model1:',summary(model1)$adj.r.squared)))
## [1] "Adj. R-square of model1:" "0.363216830561439"
print(paste0(c('Adj. R-square of model2:',summary(model2)$adj.r.squared)))
## [1] "Adj. R-square of model2:" "0.407443764925801"

#Answers:

  • Yes, the explanatory power increased in model2 with log transformations
  1. 4 pts Similar to Question B3, assess and interpret all model assumptions of model2. A model is considered a good fit if all assumptions hold. Based on your interpretation of the model assumptions, is model2 a good fit?
print('Summary of Model2:')
## [1] "Summary of Model2:"
summary(model2)
## 
## Call:
## lm(formula = transformed_y ~ log_chmax_1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22543 -0.59429  0.01065  0.59287  1.85995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.47655    0.14152    17.5   <2e-16 ***
## log_chmax_1  0.64819    0.05401    12.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared:  0.4103, Adjusted R-squared:  0.4074 
## F-statistic:   144 on 1 and 207 DF,  p-value: < 2.2e-16

#Answer:

  • The model with transformations improved, R-Square increased and all estimated parameters still remained statistically significant

  • The clustering is gone after transformation, heteroskedasticity is gone and the data is more normal; yes, I think this is a much improved fit

Question B5: Prediction - 3 pts

Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?

# predict using model1
new = data.frame(chmax = 128)
predict.lm(model1, new, interval = "predict", level = 0.95)
##        fit      lwr      upr
## 1 516.4685 252.2519 780.6851
# predict using model2
chmax <- c(log(c(128)+1))
new=list(chmax)
#take the anti-log=> reversing response variable to original scale
cat('model2 prediction and its interval in the original scale', end="\n")
## model2 prediction and its interval in the original scale
head(exp(predict.lm(model2, new, interval="prediction", level=0.95)),1)
##       fit      lwr      upr
## 1 277.723 55.17907 1397.813

#Answer:

Model1 predicts a performance of 516.4685 with a lower bound of 252.2519 and an upper bound of 780.6851 with 95% confidence interval. Model2 predicts a performance of 277.723 with a lower bound of 55.17907 and an upper bound of 1397.813 for 95% confidence interval. Model2 with log transformation, has a larger confidence interval than model1, the model without data transformations. In addition, the predicted value is much lower in Model2 than Model1.

Part C. ANOVA - 8 pts

We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.

# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
  1. 2 pts Using data2, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.
ggplot(data2, aes(x=vendor, y=performance, color=vendor)) +
geom_boxplot() +
ggtitle("Box Plot of Vendor vs Performance")+theme_economist()

#Answer:

  1. 3 pts Perform an ANOVA F-test on the means of the three vendors. Using an \(\alpha\)-level of 0.05, can we reject the null hypothesis that the means of the three vendors are equal? Please interpret.
model3 = aov(performance ~ vendor, data2)
summary(model3)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## vendor       2 154494   77247   6.027 0.00553 **
## Residuals   36 461443   12818                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer:

The p-value of the F-test is 0.005527, which is less than the α-level of 0.05(the one with one asterisk). So we reject the null hypothesis that the mean CPU performance of all three vendors are equal, and conclude that the mean CPU performance of at least one vendor is different.

  1. 3 pts Perform a Tukey pairwise comparison between the three vendors. Using an \(\alpha\)-level of 0.05, which means are statistically significantly different from each other?
TukeyHSD(model3, "vendor", conf.level=0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = performance ~ vendor, data = data2)
## 
## $vendor
##                    diff        lwr      upr     p adj
## hp-honeywell  -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320   16.82659 216.0398 0.0188830
## nas-hp        140.46617   18.11095 262.8214 0.0214092

#Answer:

Only nas-honeywell and nas-hp have p-values of the pairwise comparisons lower than α-level of 0.05 and do not contain zero between their lower and upper bounds. This indicated these 2 pairs are statistically significantly different from each other at the α-level of 0.05 while hp-honeywell does contain zero and it’s p-level higher than 0.05 indicating that its not statistically significantly different.