Part A. ANOVA

Additional Material: ANOVA tutorial

https://datascienceplus.com/one-way-anova-in-r/

Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial study reported that the human circadian clock can also be reset by only exposing the back of the knee to light, with some hailing this as a major discovery and others challenging aspects of the experimental design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to neither (control group). The effects of treatment to the circadian rhythm were measured two days later by the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while a positive number indicates an advance.

Raw data of phase shift, in hours, for the circadian rhythm experiment

Treatment	Phase Shift (hr)
Control	0.53, 0.36, 0.20, -0.37, -0.60, -0.64, -0.68, -1.27
Knees	0.73, 0.31, 0.03, -0.29, -0.56, -0.96, -1.61
Eyes	-0.78, -0.86, -1.35, -1.48, -1.52, -2.04, -2.83

Question A1 - 3 pts

Consider the following incomplete R output:

Fill in the missing values in the analysis of the variance table.Note: Missing values can be calculated using the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the ANOVA table using the aov() function. Either approach will be accepted.

#Clear the Environment
rm(list=ls())
setwd("C:/Users/krbm878/Documents/00 Graduate School/GT/ISYE 6414/Data")
set.seed(123)

#read in the Data

jetlag <- read.table("jetlag.txt", stringsAsFactors=FALSE, header=TRUE)

anovtable <- aov(PhaseShift ~ Treatment, data = jetlag, typ=2)

dF_Treat <- 3 - 1
dF_Error <- 22 - 3
dF_Total <- dF_Treat + dF_Error
MS_T <- 3.6122
SS_T <- MS_T * dF_Treat
SS_E <- 9.415
MS_E <- SS_E / dF_Error

SS_TOTAL <- SS_T + SS_E

summary(anovtable)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Treatment    2  7.224   3.612   7.289 0.00447 **
## Residuals   19  9.415   0.496                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\t ","\t | Df   |","Sum Sq","\nTOTAL","|",dF_Total, "\t|", SS_TOTAL)

##       | Df   | Sum Sq 
## TOTAL | 21   | 16.6394

Answer:

Source	Df	Sum of Squares	Mean Squares	F-statistics	p-value
Treatments	2	7.224	3.6122	7.29	0.004
Error	19	9.415	0.4955
TOTAL	21	16.639

Question A2 - 3 pts

Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based on the context of the topic above (i.e. explain what \(\mu_1\), \(\mu_2\), and \(\mu_3\) mean in words in the context of this problem). Find the estimates of these parameters.

# Specify data column
aggregate(x= jetlag$PhaseShift,     
            
         # Specify group indicator
         by = list(jetlag$Treatment),      
            
         # Specify function (i.e. mean)
         FUN = mean)

Answer: The mean phase shift estimates for each of the three categories are:

Control Group: \(\mu_1\) = \(\mu_C\) = -0.2962500
Knee Group: \(\mu_2\) = \(\mu_K\) = -0.3357143
Eyes Group: \(\mu_3\) = \(\mu_E\) = -1.5514286

Question A3 - 5 pts

Use the ANOVA table in Question A1 to answer the following questions:

1 pts Write the null hypothesis of the ANOVA \(F\)-test, \(H_0\):

Answer: The Null Hypothesis for the ANOVA Test is \(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) (all means are equivalent)
1 pts Write the alternative hypothesis of the ANOVA \(F\)-test, \(H_A\)

Answer: The Alternative Hypothesis for the ANOVA Test is \(H_1\): \(\mu_1\) \(\ne\) \(\mu_2\), \(\mu_1\) \(\ne\) \(\mu_3\), and/or \(\mu_2\) \(\ne\) \(\mu_3\)
1 pts Fill in the blanks for the degrees of freedom of the ANOVA \(F\)-test statistic:

Answer:\(F\)(2, 19)
1 pts What is the p-value of the ANOVA \(F\)-test?

Answer:The p-value of the \(F\)-test is 0.004, which is shown above.
1 pts According the the results of the ANOVA \(F\)-test, does light treatment affect phase shift? Use an \(\alpha\)-level of 0.05.

Answer:We reject the null hypothesis at 95% Confidence that all means are equal because the p-value is less than 0.05. The mean phase shift is not the same for each light treatment, so therefore, the light treatment used statistically affects the phase shift of the assessment.

Part B. Simple Linear Regression

We are going to use regression analysis to estimate the performance of CPUs based on the maximum number of channels in the CPU. This data set comes from the UCI Machine Learning Repository.

The data file includes the following columns:

vendor: vendor of the CPU
chmax: maximum channels in the CPU
performance: published relative performance of the CPU

The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv().

#Read the data
setwd("C:/Users/krbm878/Documents/00 Graduate School/GT/ISYE 6414/Data")
rm(list=ls())

library(car)

data = read.csv("machine.csv", head = TRUE, sep = ",")
head(data)

Question B1: Exploratory Data Analysis - 9 pts

3 pts Use a scatter plot to describe the relationship between CPU performance and the maximum number of channels. Describe the general trend (direction and form). Include plots and R-code used.

#Read the data

#load library

library (ggfortify)
## Loading required package: ggplot2
library (ggplot2)
library (cowplot)

ggplot(data=data, aes(x=chmax, y=performance)) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
   xlab('CPU Maximum Channels') + 
   ylab('CPU Performance') + 
   ggtitle('Comparison of CPU Performance versus Maximum Number of Channels')

ggplot(data=data, aes(x=chmax, y=performance, color=vendor)) + geom_point() + 
   stat_smooth(method = "lm", se = FALSE) +
   xlab('CPU Maximum Channels') + 
   ylab('CPU Performance') + 
   ggtitle('Comparison of CPU Performance versus Maximum Number of Channels')

Explaination: The general trend of CPU Performance versus CPU Maximum Channels appears to have a Positive increasing association. However, as the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold.

3 pts What is the value of the correlation coefficient between performance and chmax? Please interpret the strength of the correlation based on the correlation coefficient.

# Calculate the correlation coefficient
cor(data$chmax,data$performance)

## [1] 0.6052093

Explaination: The correlation coefficient of 0.6052 confirms there is a moderately positive, increasing association between CPU Maximum Channels and CPU Performance. The strength of this correlation is moderate as the correlation coefficient between 0.5 and 0.7. This aligns with the observations from Part a.

2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for the relationship?

Explaination: The assumptions of the simple linear regression are Linearity, constant variance, independence, and normality. Since constance variance does not hold, I would not recommend a simple linear regression model for the relationship. Additionally, the moderate strength of the correlation coefficient, it wouldn’t be reasonable to assume a simple linear regression model for this relationship would provide the desired insights.

1 pts Based on the analysis above, would you pursue a transformation of the data? Do not transform the data.

Explaination: Since the variance in the CPU Performance increases with increasing Number of Channels, a transformation would be appropriate to improve the linearity between the response variable. Since the constant variance assumption does not hold, a transformation that normalizes or variance-stablizes the response variable would help in this case.A power transformation of the response variable could be performed. If lambda is equal to 1, we do not transform. If lambda is equal to 0, we could use the normal logarithmic transformation. If lambda is equal to -1, we could use the inverse of y, known as a Box-Cox Transformation.

Question B2: Fitting the Simple Linear Regression Model - 11 pts

Fit a linear regression model, named model1, to evaluate the relationship between performance and the maximum number of channels. Do not transform the data. The function you should use in R is:

library(caret)

## Loading required package: lattice

model1 = lm(performance~ chmax, data=data)
summary(model1)

## 
## Call:
## lm(formula = performance ~ chmax, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.47  -42.20  -22.20   20.31  867.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2252    10.8587   3.428 0.000733 ***
## chmax         3.7441     0.3423  10.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3632 
## F-statistic: 119.6 on 1 and 207 DF,  p-value: < 2.2e-16

#Coefficients (B0)
print(model1$coefficients)

## (Intercept)       chmax 
##   37.225220    3.744088

#Sigma
sigma <-summary(model1)$sigma
#Variance of error terms
sigmasq <- sigma^2

sigma

## [1] 128.3408

sigmasq

## [1] 16471.37

3 pts What are the model parameters and what are their estimates?

Explaination: The model parameters are provided above and the estimates are as follows: \(B_0\) (Intercept) and and estimate of \(\hat{B_0}\) is 37.225220. \(B_1\) (slope) and and estimate of \(\hat{B_1}\) is 3.744088. \(\sigma^2\) and and estimate of \(\hat{\sigma^2}\) is 16471.37.

2 pts Write down the estimated simple linear regression equation.

\(performance\) = 3.744088 x (\(chmax\)) + 37.225220
2 pts Interpret the estimated value of the \(\beta_1\) parameter in the context of the problem.

For every single unit increase in CPU maximum channels, the CPU performance increases by 3.744088 units.
2 pts Find a 95% confidence interval for the \(\beta_1\) parameter. Is \(\beta_1\) statistically significant at this level?

confint(model1,level=0.95)

##                 2.5 %    97.5 %
## (Intercept) 15.817392 58.633048
## chmax        3.069251  4.418926

t_test = summary(model1)$coefficients['chmax','t value']
df = model1$df.residual

p = pt(t_test,df,lower=F)
p

## [1] 1.423882e-22

The 95% confidence interval of \(\hat{B_1}\) is (3.069251, 4.418926), and \(\hat{B_1}\) is 3.744088. This is statistically significantly positive because the pvalue is 1.423882e-22 which is less than the \(\alpha\) level of 0.05

2 pts Is \(\beta_1\) statistically significantly positive at an \(\alpha\)-level of 0.01? What is the approximate p-value of this test?

\(\hat{B_1}\) is 3.744088, which is statistically significantly positive because the pvalue is 1.423882e-22 which is less than the \(\alpha\) level of 0.01

Question B3: Checking the Assumptions of the Model - 8 pts

Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.

2 pts Scatterplot of the data with chmax on the x-axis and performance on the y-axis

ggplot(data=data, aes(x=chmax, y=performance)) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
   xlab('CPU Maximum Channels') + 
   ylab('CPU Performance') + 
   ggtitle('Comparison of CPU Performance versus Maximum Number of Channels') + 
   geom_line(data=data,aes(x=chmax,y=model1$fitted.values),color='blue')

## `geom_smooth()` using formula 'y ~ x'

Model Assumption(s) it checks: Linearity and constant variance

Interpretation: There are apparent departures from the assumptions of the linear regression model for linearity, as the strength of this correlation is moderate as the correlation coefficient between 0.5 and 0.7. After approximately 40 CPU Maximum Channels, the fitted linear model is no longer centered with the data, as there are more measurements of CPU performance below the fitted than and very few measurements above the fitted line. Additionally, as the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold.

3 pts Residual plot - a plot of the residuals, \(\hat\epsilon_i\), versus the fitted values, \(\hat{y}_i\)

ggplot(data=data, aes(x=model1$fitted.values, y=model1$residuals)) + geom_point() + 
   geom_hline(yintercept = 0, color='red') +
   xlab('Fitted Values') + 
   ylab('Residuals') + 
   ggtitle('Residual Plot of Model')

**Model Assumption(s) it checks: Constant variance and potentially Linearity as the relationship is not linear.

Interpretation:As the CPU Maximum Channels increases, so does the variance in CPU Performance. This form seems to showcase a megaphone effect, where likely the residuals will increase with increasing fitted values. This means that the constant variance assumption does not hold. Additionally, the widening of the relationship between residuals and the predicting variable showing their relationship is not linear and is a departure from the linearity assumption.

3 pts Histogram and q-q plot of the residuals

library(magrittr)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggplot(data = data, aes(x = model1$residuals)) +
    geom_histogram(bins = 15, fill = 'blue', color = 'black') +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

ggplot(data = data, aes(sample = model1$residuals)) +
    stat_qq(color='darkgreen') + stat_qq_line() +
    labs(title = 'Q-Q Plot of Residuals', x = 'Theoretical Quantiles', y = 'Residual Quantiles')

shapiro.test(model1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.75156, p-value < 2.2e-16

Model Assumption(s) it checks: Normality

Interpretation: Since the quantiles of the residuals do not line up with the normal quantiles, thereby departing from a straight line and forming tails at both ends, there is an indication of either a skewed distribution, or heavy-tail distribution. The histogram confirms that there heavy-tails in our distribution, thereby confirming the normality asssumption does not hold. The Shapiro-Wiki, checks for Normality and reported a p-value near 0, which is less than 0.05 (95% confidence level). The null hypothesis for the test is that the data is normally distributed is rejected by this this test, confirming Normality in the data does not hold.

Question B4: Improving the Fit - 10 pts

2 pts Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal \(\lambda\) value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?

library (car)
library (MASS)

box_cox_plot = boxCox(model1)

lambda <- box_cox_plot$x[which.max(box_cox_plot$y)]

cat("Optimal Lambda:", format(round(lambda/0.5)*0.5,nsmall=1),end="\n")

## Optimal Lambda: 0.0

Interpretation: Since \(\lambda\) is equal to 0, a normal logarithmic transformation could normalize or variance-stablize the response variable since the normality and constant variance assumption does not hold.

2 pts Create a linear regression model, named model2, that uses the log transformed performance as the response, and the log transformed chmax as the predictor. Note: The variable chmax has a couple of zero values which will cause problems when taking the natural log. Please add one to the predictor before taking the natural log of it

model2 = lm(log(performance)~ log(chmax+1), data=data)
summary(model2)

## 
## Call:
## lm(formula = log(performance) ~ log(chmax + 1), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22543 -0.59429  0.01065  0.59287  1.85995 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.47655    0.14152    17.5   <2e-16 ***
## log(chmax + 1)  0.64819    0.05401    12.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 207 degrees of freedom
## Multiple R-squared:  0.4103, Adjusted R-squared:  0.4074 
## F-statistic:   144 on 1 and 207 DF,  p-value: < 2.2e-16

2 pts Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?

cat("R-Squared values Before Transformation (Model 1): ", summary(model1)$r.squared)

## R-Squared values Before Transformation (Model 1):  0.3662783

cat("R-Squared values After Transformation (Model 2): ", summary(model2)$r.squared)

## R-Squared values After Transformation (Model 2):  0.4102926

Observation: After transforming our simple linear regression, we can see in improvement in our correlation coefficient, which indicates an improves the explanitory power and linearity assumption in the simple linear regression model.

4 pts Similar to Question B3, assess and interpret all model assumptions of model2. A model is considered a good fit if all assumptions hold. Based on your interpretation of the model assumptions, is model2 a good fit?

# Assess all Model Assumptions

scatter_plot_2 = ggplot(data=data, aes(x=log(chmax+1), y=log(performance))) + geom_point() + stat_smooth(method = "lm", se = FALSE) +
   xlab('CPU Maximum Channels') + 
   ylab('CPU Performance') + 
   ggtitle('CPU Performance vs Maximum # of Channels') + 
   geom_line(data=data,aes(x=log(chmax+1),y=model2$fitted.values),color='blue')

residual_2 = ggplot(data=data, aes(x=model2$fitted.values, y=model2$residuals)) + geom_point() + 
   geom_hline(yintercept = 0, color='red') +
   xlab('Fitted Values') + 
   ylab('Residuals') + 
   ggtitle('Residual Plot of Model 2') 

histo_2 = ggplot(data = data, aes(x = model2$residuals)) +
    geom_histogram(bins = 12, fill = 'blue', color = 'black') +
    labs(title = 'Histogram of Model 2 Residuals', x = 'Residuals', y = 'Frequency')

qqp_2 =ggplot(data = data, aes(sample = model2$residuals)) +
    stat_qq(color='darkgreen') + stat_qq_line() +
    labs(title = 'Q-Q Plot of Residuals', x = 'Theoretical Quantiles', y = 'Residual Quantiles')

shapiro.test(model2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.99287, p-value = 0.4086

plot_grid(scatter_plot_2,residual_2,histo_2,qqp_2,ncol=2,nrow=2)

## `geom_smooth()` using formula 'y ~ x'

Analysis:

Linearity/Mean Zero - The CPU Performance data appears to be evenly distributed above and below the model line across the full range of the CPU Maximum number of channels. So linearity holds true.
Constant Variance- The residuals are random and show equal variance around the zero line, confirming that the constant variance assumption holds true.
Independence - From the histogram, there does not appear to be any clustering effect, therefore, the residual errors are likely uncorrelated, so the Independence assumption holds.
Normality - The Histogram and the Q-Q Plot suggest that the normaility assumption holds true, as the data is equally distributed around the center and the quantiles of the residuals do line up with the normal quantiles. Additionally, The Shapiro-Wiki, checks for Normality and reported a p-value of 0.41, which is greater than 0.05 (95% confidence level). The null hypothesis for the test is that the data is normally distributed is accepted by this this test, confirming Normality in the data does hold.

Question B5: Prediction - 3 pts

Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction using both model1 and model2 and provide the 95% prediction interval of each prediction on the original scale of the response, performance. What observations can you make about the result in the context of the problem?

#Find predicted CPU Performance when Max Channels is 128 using linear regression models
CPU_Pred = data.frame(chmax=128)

cat("Model 1 Prediction:","\n")

## Model 1 Prediction:

predict(model1,CPU_Pred,interval="prediction", level=0.95)

##        fit      lwr      upr
## 1 516.4685 252.2519 780.6851

cat("Model 2 Prediction:","\n")

## Model 2 Prediction:

exp(predict(model2,CPU_Pred,interval="prediction", level=0.95))

##       fit      lwr      upr
## 1 277.723 55.17907 1397.813

Analysis: Model 1’s predicted CPU Performance is 516.5 with a 95% confidence interval of (252.3,780.7). Model 2’s predicted CPU Performance is lower at 277.7 with a 95% confidence interval of (55.2,1387.8). Model 2’s confidence interval is much larger because of the log transformation applied to the data. However, each models predicted values fall within the 95% confidence interval of the other model (model 1’s prediction falls in the CI of model 2, etc) and the predictions are within the range of the data set.

Part C. ANOVA - 8 pts

We are going to continue using the CPU data set to analyse various vendors in the data set. There are over 20 vendors in the data set. To simplify the task, we are going to limit our analysis to three vendors, specifically, honeywell, hp, and nas. The code to filter for those vendors is provided below.

# Filter for honeywell, hp, and nas
data2 = data[data$vendor %in% c("honeywell", "hp", "nas"), ]
data2$vendor = factor(data2$vendor)

head(data2)

2 pts Using data2, create a boxplot of performance and vendor, with performance on the vertical axis. Interpret the plots.

ggplot(data=data2, aes(x=vendor, y=performance, color=vendor)) + geom_boxplot() + 
   xlab('Vendor') + 
   ylab('CPU Performance') + 
   ggtitle('Box Plot of CPU Performance for Each Vendor')

Analysis: CPU performance does appear to differ depending on the vendor utilized. The Vendor NAS has the widest range of reported CPU Performance compared to hp and honeywell. However, there is much more variability in CPU performance in NAS, whereas HP has the lowest CPU Performance, but less variability in the CPU performance.

3 pts Perform an ANOVA F-test on the means of the three vendors. Using an \(\alpha\)-level of 0.05, can we reject the null hypothesis that the means of the three vendors are equal? Please interpret.

anovtableCPU <- aov(performance ~ vendor, data = data2)

summary(anovtableCPU)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## vendor       2 154494   77247   6.027 0.00553 **
## Residuals   36 461443   12818                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis: The null hypothesis for the \(F-test\) is means are equal for the CPU performance for all three vendors. Since the \(F-test\) \(Statistic\) is \(0.0055\), which is less than our \(\alpha\) of \(0.05\), we reject the null hypothesis and conclude that at least one of the vendors has a different mean CPU Performance.

3 pts Perform a Tukey pairwise comparison between the three vendors. Using an \(\alpha\)-level of 0.05, which means are statistically significantly different from each other?

tukey.test <- TukeyHSD(anovtableCPU, "vendor", conf.level = 0.95)
tukey.test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = performance ~ vendor, data = data2)
## 
## $vendor
##                    diff        lwr      upr     p adj
## hp-honeywell  -24.03297 -153.76761 105.7017 0.8934786
## nas-honeywell 116.43320   16.82659 216.0398 0.0188830
## nas-hp        140.46617   18.11095 262.8214 0.0214092

plot(tukey.test)

Analysis: From the Tukey test for the pairwise comparison of the mean CPU Performance for each vendor, we can see the mean CPU performance between NAS and Honeywell/HP is significantly different and have p values less than our \(\alpha\) of \(0.05\). Therefore, the Mean CPU Performance for NAS is statistically significantly different from hp and honeywell. Only the mean CPU performance 95% confidence intervals between hp and honeywell contain 0 in the graph above, indicating we reject the hypothesis that these two vendor’s mean CPU performance is statistically different.

HW1 Assignment