Part A. ANOVA

Additional Material: ANOVA tutorial

https://datascienceplus.com/one-way-anova-in-r/

Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.

Raw data of phase shift, in hours, for the circadian rhythm experiment

Treatment Phase Shift (hr)
Control 0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27
Knees 0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61
Eyes -0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83

Question A1 - 3 pts

## Read data into R
shift_hours = read.table("anova_data.txt",header = T)

## Response Variable

phaseshift = shift_hours$PhaseShift
treatment = shift_hours$Treatment

## Convert into categorical variable in R
treatment = as.factor(treatment)

model = aov(phaseshift ~ treatment)
summary(model)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## treatment    2  7.224   3.612   7.289 0.00447 **
## Residuals   19  9.415   0.496                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Consider the following incomplete R output:

Source Df Sum of Squares Mean Squares F-statistics p-value
Treatments 2 7.224 3.6122 7.289 0.004
Error 19 9.415 0.496
TOTAL 21 16.639

Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.

Question A2 - 3 pts

Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.

Solution A2:

## Obtain estimated means
model.tables(model, type="means")
## Tables of means
## Grand mean
##            
## -0.7127273 
## 
##  treatment 
##     Control   Eyes   Knees
##     -0.3087 -1.551 -0.3357
## rep  8.0000  7.000  7.0000

Control: \(\mu_1 = -0.3087\)

Eyes: \(\mu_2 = -1.551\)

Knees: \(\mu_3 = -0.3357\)

Question A3 - 5 pts

Use the ANOVA table in Question A1 to answer the following questions:

  1. 1 pts Write the null hypothesis of the ANOVA \(F\)-test, \(H_0\)

Solution A3.A: The null hypothesis of the ANOVA \(F\)-test is \(H_0:\) \(\mu_1\)=\(\mu_2\)=\(\mu_3\) The means are equal.

  1. 1 pts Write the alternative hypothesis of the ANOVA \(F\)-test, \(H_A\)

Solution A3.B: The alternative hypothesis of the ANOVA \(F\)-test, \(H_1:\) \(\mu_1≠\mu_2\),\(\mu_1≠\mu_3\), and \(\mu_2≠\mu_3\) The means are not equal.

  1. 1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic:

Solution A3.C: \(F\)(2, 19)

  1. 1 pts What is the p-value of the ANOVA \(F\)-test?

Solution A3.D: p-value of the ANOVA \(F\)-test = 0.00447

  1. 1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05.

Solution A3.E: For a \(\alpha\)-level of 0.05 we would reject the the null hypotheses at 95% confidence that all means are equal because the p-value of 0.00447 is less than 0.05, hence the light treatments used for this study do affect phase shift.

Part B. Simple Linear Regression

We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.

The data file includes the following columns:

The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

# Read in the data
data = read.csv("machine.csv", head = TRUE, sep = ",")
# Show the first few rows of data
head(data, 3)

Question B1: Exploratory Data Analysis - 9 pts

  1. 3 pts Use a scatter plot to describe the relationship between CPU performance and the maximum number of channels. Describe the general trend (direction and form). Include plots and R-code used.

Solution B1.A:

chmax = data$chmax
vendor = data$vendor
performance = data$performance
plot(chmax,performance,ylab="Level of Performance",xlab="Max CPU Channels",
     main="CPU Channels vs CPU Performance")
abline(lm(performance ~ chmax, data = data), col = "blue")

In general we can gather from the scatter plot that there is a slight positive correlation between the number of CPU channels versus the level of CPU performance. With respect to the form, it can be inferred from the scatter plot that as the number of CPU channels increase, the variance in the CPU performance also increases. Also, if you look at the fitted vs residuals plot you can see a cone shape that can elude to a non constant variance.

  1. 3 pts What is the value of the correlation coefficient between performance and chmax? Please interpret the strength of the correlation based on the correlation coefficient.

Solution B1.B:

# Correlation Coefficient
cor(chmax, performance)
## [1] 0.6052093

Because the correlation coefficient of 0.6052093 is greater than 0, there is a positive relationship between the number of CPU channels and the level of CPU performance. Because 0.5 < 0.6052093 < 0.7 the strength of the relationship is “Moderate”.

  1. 2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

Solution B1.C: The assumptions of a simple linear regression model include Linearity/mean zero assumption, constant variance assumption, independence assumption, and normality assumption.

model1 = lm(performance ~ chmax, data)
plot(fitted(model1),resid(model1),main="Fitted vs Residuals",
     xlab="Fitted Values")

Based on the fitted versus residuals plot it is clear that the constant variance assumption is violated. In combination with a “Moderate” strength correlation coefficient I would not recommend a simple linear regression model for this data set.

  1. 1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

Solution B1.D: Because the variance in CPU performance increases with increasing number of CPU channels a data transformation would be acceptable.Specifically because the constant variance assumption does not, hold a transformation that would normalize the variance would help improve the linear model.

Question B2: Fitting the Simple Linear Regression Model - 11 pts

Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:

# Your code here...
model1 = lm(performance ~ chmax, data)
summary(model1)
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.47  -42.20  -22.20   20.31  867.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2252    10.8587   3.428 0.000733 ***
## chmax         3.7441     0.3423  10.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3632 
## F-statistic: 119.6 on 1 and 207 DF,  p-value: < 2.2e-16
  1. 3 pts What are the model parameters and what are their estimates?
# Your code here...
summary(model1)
## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.47  -42.20  -22.20   20.31  867.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2252    10.8587   3.428 0.000733 ***
## chmax         3.7441     0.3423  10.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3632 
## F-statistic: 119.6 on 1 and 207 DF,  p-value: < 2.2e-16
betas = coef(model1)
summary(betas)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.744  12.114  20.485  20.485  28.855  37.225
print(model1$coefficients)
## (Intercept)       chmax 
##   37.225220    3.744088

Solution B2.A: The model parameters are output of the code above and the estimates are the following \(\beta_0\) or the intercept and the estimate of \(\hat\beta_0\) is 37.225220. \(\beta_1\) or the slope and the estimate of \(\hat\beta_1\) is 3.744088. The variance \(\sigma^2\) and the estimate of \(\hat\sigma^2\) is 16471.37

  1. 2 pts Write down the estimated simple linear regression equation.

Solution B2.B:

performance = 3.744088*chmax + 37.225220

  1. 2 pts Interpret the estimated value of the \(\beta_1\) parameter in the context of the problem.

Solution B2.C: The estimated value of the \(\beta_1\) parameter in the context of this problem is the slope, so for every increase in CPU channels, the CPU performance increases by 3.744088 units of performance.

  1. 2 pts Find a 95% confidence interval for the \(\beta_1\) parameter. Is \(\beta_1\) statistically significant at this level?
## Confidence intervals for the coefficients
confint(model1,level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 15.817392 58.633048
## chmax        3.069251  4.418926
summary(model1)$coefficients['chmax','t value']
## [1] 10.93809
t.test(chmax, performance = NULL,conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  chmax
## t = 10.159, df = 208, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  14.72277 21.81312
## sample estimates:
## mean of x 
##  18.26794

Solution B2.D: The 95% confidence interval for the \(\hat\beta_1\) is (3.069251,4.418926). Yes, \(\hat\beta_1\) is statistically significant at this level.

  1. 2 pts Is \(\beta_1\) statistically significantly positive at an \(\alpha\)-level of 0.01? What is the approximate p-value of this test?
## Confidence intervals for the coefficients
confint(model1,level = 0.99)
##                0.5 %    99.5 %
## (Intercept) 8.994891 65.455549
## chmax       2.854185  4.633991
summary(model1)$coefficients['chmax','t value']
## [1] 10.93809
t_value = summary(model1)$coefficients['chmax','t value']
pvalue = pt(abs(t_value),207,lower=F)
pvalue
## [1] 1.423882e-22

Solution B2.E: Because \(\hat\beta_1\) is 3.744088, it is statistically significantly positive because the p-value is 1.423882e-22 which is <<< than alpha level of 0.01

Question B3: Checking the Assumptions of the Model - 8 pts

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

  1. 2 pts Scatterplot of the data with chmax on the x-axis and performance on the y-axis
plot(chmax,performance,ylab="Level of Performance",xlab="Max CPU Channels",
     main="CPU Channels vs CPU Performance")
abline(lm(performance ~ chmax, data = data), col = "blue")

Solution B3.A: Model Assumption(s) it checks: Linearity and Constant variance

Interpretation: There is a slight positive correlation between the number of CPU channels versus the level of CPU performance. With respect to the form, it can be inferred from the scatter plot that as the number of CPU channels increase, the variance in the CPU performance also increases.Also, if you look at the the plot you can see a cone shape that can elude to a non constant variance.

  1. 3 pts Residual plot - a plot of the residuals, \(\hat\epsilon_i\), versus the fitted values, \(\hat{y}_i\)
plot(fitted(model1),resid(model1),main="Fitted vs Residuals",
     xlab="Fitted Values")
abline(0,0)

Solution B3.B: Model Assumption(s) it checks: It checks for linearity, constant variance, and potentially if the response data are not independent

Interpretation: The residuals show larger variance as the fitted values increases. There for showing that constant variance is violated. There are some clustering of residuals which could also mean that independence assumption could possibly not hold.

  1. 3 pts Histogram and q-q plot of the residuals
qqnorm(resid(model1),main="QQ-Plot of Residuals")
qqline(resid(model1))

hist(resid(model1),main="Histogram of Residuals")

Solution B3.C: Model Assumption(s) it checks: Normality

Interpretation: Because of the curvature at the tail ends of the theoretical quantiles is a significant tell for non-normality. Also, based on the distribution of the histogram and the distribution is slightly not uniform and there are gaps in data that persist as the residuals increase hence a violation in the normality assumption.

Question B4: Improving the Fit - 10 pts

  1. 2 pts Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?
library (car)
## Loading required package: carData
library (MASS)
boxmodel = boxCox(model1)

(lambda <- boxmodel$x[which.max(boxmodel$y)])
## [1] -0.1010101
summary(boxmodel)
##   Length Class  Mode   
## x 100    -none- numeric
## y 100    -none- numeric

Solution B4.A: The optimal \(\lambda\) value rounded to the nearest half integer is equal to 0.Because lambda is equal to 0 a normal logarithmic transformation could be used to normalize the response.

  1. 2 pts Create a linear regression model, named model2, that uses the log transformed performance as the response, and the log transformed chmax as the predictor. Note: The variable chmax has a couple of zero values which will cause problems when taking the natural log. Please add one to the predictor before taking the natural log of it

Solution B4.B:

model2 = lm(log(performance)~log(chmax+1))
summary(model2)
## 
## Call:
## lm(formula = log(performance) ~ log(chmax + 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22543 -0.59429  0.01065  0.59287  1.85995 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.47655    0.14152    17.5   <2e-16 ***
## log(chmax + 1)  0.64819    0.05401    12.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared:  0.4103, Adjusted R-squared:  0.4074 
## F-statistic:   144 on 1 and 207 DF,  p-value: < 2.2e-16
  1. 2 pts Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?

Solution B4.C:

cat("R-Squared values Before Transformation (Model 1): ", summary(model1)$r.squared)
## R-Squared values Before Transformation (Model 1):  0.3662783
cat("R-Squared values After Transformation (Model 2): ", summary(model2)$r.squared)
## R-Squared values After Transformation (Model 2):  0.4102926

There is a improvement in the explanatory power of the model, which tells us that by transforming the model we were able to improve our linearity assumption of the model.

  1. 4 pts Similar to Question B3, assess and interpret all model assumptions of model2. A model is considered a good fit if all assumptions hold. Based on your interpretation of the model assumptions, is model2 a good fit?
plot(log(chmax+1),log(performance),ylab="Level of Performance",xlab="Max CPU Channels",
     main="CPU Channels vs CPU Performance")
abline(lm(log(performance) ~ log(chmax+1), data = data), col = "blue")

plot(fitted(model2),resid(model2),main="Fitted vs Residuals",
     xlab="Fitted Values")
abline(0,0)

qqnorm(resid(model2),main="QQ-Plot of Residuals")
qqline(resid(model2))

hist(resid(model2),main="Histogram of Residuals")

Solution B4.D: Linearity Assumption: The data now is evenly distributed in a linear increasing fashion above and below the linear fitted line of the transformed data. This assumption holds True.

Constant Variance Assumption:The residuals are evenly scatter above and below the model line at random. This assumption holds True.

Independence Assumption: There are now no visible clustering effects going on in the data. From the histogram there also is no sign of clustering effects.Residual effects are likely uncorrelated. This assumption holds True.

Normality Assumption: This histogram shows a equally distributed plot around the center with no gaps in the data and there is slightly less tailing on the QQ plot. Also, the quantiles of the residuals line up well with the normal quantiles and only slightly depart from the line on the tail ends. This Assumption holds True.

Question B5: Prediction - 3 pts

Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?

chmax_128 = data.frame(chmax=128)

predict(model1,chmax_128,interval="prediction", level=0.95)
##        fit      lwr      upr
## 1 516.4685 252.2519 780.6851
exp(predict(model2,chmax_128,interval="prediction", level=0.95))
##       fit      lwr      upr
## 1 277.723 55.17907 1397.813

Solution B5: Model 1s prediction of CPU performance at a chmax of 128 is 516.4685. With a 95% confidence interval of (252.2519,780.6851) Model 2s prediction of CPU performance at a chmax of 128 is 277.723.With a 95% confidence interval of (55.17907,1397.813)

Both predicted results fall in both confidence intervals i.e. 516.4685 is within (55.17907,1397.813) and 277.723 is within (252.2519,780.6851). Model 2s confidence interaval is much larger than model1 because of the logarithmic transformation.

Part C. ANOVA - 8 pts

We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.

# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)
  1. 2 pts Using data2, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.
performance1 = data2$performance
vendor1 = data2$vendor
boxplot(performance1~vendor1, xlab = "Vendor", ylab = "CPU Performance")

Solution C1: The variability of NAS is the highest but overall they have performance data over a wider spread than both HP and Honeywell. The variability of Honeywell is the slightly more than HP but overall Honeywell performance data is around a lower performance level than that of NAS. HP has the lowest performing CPU out of all 3 vendors but they also have the least variability present in their performance data.

  1. 3 pts Perform an ANOVA F-test on the means of the three vendors. Using an \(\alpha\)-level of 0.05, can we reject the null hypothesis that the means of the three vendors are equal? Please interpret.
model3 = aov(performance1 ~ vendor1)
summary(model3)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## vendor1      2 154494   77247   6.027 0.00553 **
## Residuals   36 461443   12818                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(model3, type="means")
## Tables of means
## Grand mean
##          
## 112.8718 
## 
##  vendor1 
##     honeywell    hp   nas
##         60.46 36.43 176.9
## rep     13.00  7.00  19.0

Solution C2: Because the F-statistic test is equal to 0.00553 which is less than our \(\alpha\)-level of 0.05 we can reject the null hypothesis. Therefor at least some means are not equivalent.

  1. 3 pts Perform a Tukey pairwise comparison between the three vendors. Using an \(\alpha\)-level of 0.05, which means are statistically significantly different from each other?
TukeyHSD(model3, "vendor1", conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = performance1 ~ vendor1)
## 
## $vendor1
##                    diff        lwr      upr     p adj
## hp-honeywell  -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320   16.82659 216.0398 0.0188830
## nas-hp        140.46617   18.11095 262.8214 0.0214092

Solution C3: We can see that nas-honeywell and nas-hp have p values less than our alpha of 0.05 which means that NAS CPU performance is statistically significantly different from HP and Honeywell. On the other hand because hp-honeywell have a p-value of 0.8934 which is greater than our alpha of 0.05 which means that HP and Honeywell have statistically similar CPU performance.