Submission
This assignment is to be submitted as .pdf knitted file to Gradescope. If you have any questions about concepts, please contact Dr. Fehlberg during office hours or email at m.fehlberg@utah.edu. If you have any questions related R and the homework assignments, please contact the Class TA.
Here is information about R Markdown (http://rmarkdown.rstudio.com) and an R Markdown Cheatsheet (https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) with some handy information. When you execute code within the notebook, the results appear beneath the code. You can also write html and Latex syntax within this document.
The One-way Analysis of Variance (ANOVA)
The One-way ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t test. It is also true that ANOVA is a special case of the GLM or regression models so as the number of levels increase it might make more sense to try one of those approaches. ANOVA also allows for comparisons of mean differences across multiple factors (Factorial or N-way ANOVA) which we will address later.
There are several ways to run ANOVA in R, so let’s start with the
simplest from the base R aov. While it’s possible to wrap
the command in a print statement I recommend you always save the results
out to an R object e.g., output.aov. It’s almost inevitable
that further analysis will ensue and the R object has a lot of useful
information. If you’re new to R a couple of quick notes: the dependent
variable goes to the left of the tilde (~) and our independent or
predictor variable(s) are placed on the right. aov is not
limited to Oneway ANOVA so adding additional factors is possible.
ANOVA is a specialized case of a general linear model (GLM) and
therefore the list object returned is actually of both the
aov and lm classes. The names
command will give you some sense of all the information contained in the
list object. The summary command gives us the key ANOVA data we need and
produces a classic ANOVA table. If you’re unfamiliar with them and want
to know more especially where the numbers come from I recommend
reviewing the Montgomery Text Ch3.
Problems:
Problem 1 (15 pts)
Imagine that you are interested in understanding whether knowing the brand of car tire can help you predict whether you will get more or less mileage before you need to replace them. We’ll draw what is hopefully a random sample of 60 tires from four different manufacturers and use the mean mileage by brand to help inform our thinking. While we expect variation across our sample we’re interested in whether the differences between the tire brands (groups) is significantly different than what we would expect in random variation within the groups.
Our research or testable hypothesis is then described \[\mu_{Firestone} \neq \mu_{Bridgestone} \neq \mu_{Milestar} \neq \mu_{Falken}\]
as at least one of the tire brand populations is different than the other three. Our null is basically “brand doesn’t matter in predicting tire mileage - all brands are the same”.
The following data set with 60 observations is available in tires.csv (file located on Canvas).
tires <- read.csv(file="tires.csv",header=TRUE)
str(tires)
## 'data.frame': 60 obs. of 2 variables:
## $ Brands : chr "Firestone" "Firestone" "Firestone" "Firestone" ...
## $ Mileage: num 33 36.4 32.8 37.6 36.3 ...
shapiro.test(tires$Mileage[tires$Brands == "Milestar"])
##
## Shapiro-Wilk normality test
##
## data: tires$Mileage[tires$Brands == "Milestar"]
## W = 0.95228, p-value = 0.561
shapiro.test(tires$Mileage[tires$Brands == "Falken"])
##
## Shapiro-Wilk normality test
##
## data: tires$Mileage[tires$Brands == "Falken"]
## W = 0.87754, p-value = 0.04361
shapiro.test(tires$Mileage[tires$Brands == "Firestone"])
##
## Shapiro-Wilk normality test
##
## data: tires$Mileage[tires$Brands == "Firestone"]
## W = 0.95353, p-value = 0.5817
shapiro.test(tires$Mileage[tires$Brands == "Bridgestone"])
##
## Shapiro-Wilk normality test
##
## data: tires$Mileage[tires$Brands == "Bridgestone"]
## W = 0.95509, p-value = 0.6078
tires$Brands <- ordered(tires$Brands, levels=c("Firestone","Bridgestone","Milestar","Falken"))
str(tires)
## 'data.frame': 60 obs. of 2 variables:
## $ Brands : Ord.factor w/ 4 levels "Firestone"<"Bridgestone"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Mileage: num 33 36.4 32.8 37.6 36.3 ...
xtabs( ~Brands, tires )
## Brands
## Firestone Bridgestone Milestar Falken
## 15 15 15 15
aggregate( Mileage ~ Brands, tires, mean )
outcome <- tires$Mileage
group <- tires$Brands
gp.means <- tapply(outcome,group,mean)
gp.means <- gp.means[group]
dev.from.gp.means <- outcome - gp.means
squared.devs <- dev.from.gp.means ^2
Y <- data.frame( group, outcome, gp.means,
dev.from.gp.means, squared.devs )
print(Y, digits = 2)
## group outcome gp.means dev.from.gp.means squared.devs
## 1 Firestone 33 35 -1.801 3.2e+00
## 2 Firestone 36 35 1.636 2.7e+00
## 3 Firestone 33 35 -2.022 4.1e+00
## 4 Firestone 38 35 2.838 8.1e+00
## 5 Firestone 36 35 1.505 2.3e+00
## 6 Firestone 36 35 1.116 1.2e+00
## 7 Firestone 35 35 -0.099 9.8e-03
## 8 Firestone 32 35 -2.420 5.9e+00
## 9 Firestone 34 35 -1.168 1.4e+00
## 10 Firestone 36 35 1.620 2.6e+00
## 11 Firestone 36 35 1.631 2.7e+00
## 12 Firestone 35 35 0.037 1.4e-03
## 13 Firestone 38 35 3.529 1.2e+01
## 14 Firestone 31 35 -4.176 1.7e+01
## 15 Firestone 33 35 -2.224 4.9e+00
## 16 Bridgestone 34 32 1.743 3.0e+00
## 17 Bridgestone 32 32 0.215 4.6e-02
## 18 Bridgestone 35 32 3.226 1.0e+01
## 19 Bridgestone 28 32 -3.901 1.5e+01
## 20 Bridgestone 31 32 -0.483 2.3e-01
## 21 Bridgestone 31 32 -0.718 5.2e-01
## 22 Bridgestone 35 32 3.058 9.4e+00
## 23 Bridgestone 34 32 2.196 4.8e+00
## 24 Bridgestone 33 32 0.772 6.0e-01
## 25 Bridgestone 31 32 -0.899 8.1e-01
## 26 Bridgestone 28 32 -3.636 1.3e+01
## 27 Bridgestone 29 32 -2.596 6.7e+00
## 28 Bridgestone 33 32 1.295 1.7e+00
## 29 Bridgestone 32 32 0.585 3.4e-01
## 30 Bridgestone 31 32 -0.855 7.3e-01
## 31 Milestar 34 35 -0.316 1.0e-01
## 32 Milestar 33 35 -1.955 3.8e+00
## 33 Milestar 33 35 -1.346 1.8e+00
## 34 Milestar 37 35 2.100 4.4e+00
## 35 Milestar 37 35 2.212 4.9e+00
## 36 Milestar 35 35 0.320 1.0e-01
## 37 Milestar 35 35 0.193 3.7e-02
## 38 Milestar 33 35 -1.286 1.7e+00
## 39 Milestar 30 35 -4.334 1.9e+01
## 40 Milestar 36 35 1.373 1.9e+00
## 41 Milestar 35 35 0.022 4.9e-04
## 42 Milestar 36 35 1.356 1.8e+00
## 43 Milestar 41 35 6.289 4.0e+01
## 44 Milestar 32 35 -2.593 6.7e+00
## 45 Milestar 33 35 -2.035 4.1e+00
## 46 Falken 40 38 1.596 2.5e+00
## 47 Falken 39 38 0.937 8.8e-01
## 48 Falken 38 38 0.124 1.5e-02
## 49 Falken 38 38 -0.305 9.3e-02
## 50 Falken 37 38 -1.414 2.0e+00
## 51 Falken 37 38 -1.033 1.1e+00
## 52 Falken 37 38 -1.270 1.6e+00
## 53 Falken 37 38 -1.000 1.0e+00
## 54 Falken 40 38 2.199 4.8e+00
## 55 Falken 37 38 -0.618 3.8e-01
## 56 Falken 41 38 2.663 7.1e+00
## 57 Falken 37 38 -0.905 8.2e-01
## 58 Falken 38 38 0.005 2.5e-05
## 59 Falken 38 38 -0.244 6.0e-02
## 60 Falken 37 38 -0.735 5.4e-01
SSw <- sum( squared.devs )
print( SSw )
## [1] 249.3583
gp.means <- tapply(outcome,group,mean)
grand.mean <- mean(outcome)
dev.from.grand.mean <- gp.means - grand.mean
squared.devs <- dev.from.grand.mean ^2
gp.sizes <- tapply(outcome,group,length)
wt.squared.devs <- gp.sizes * squared.devs
Y <- data.frame( gp.means, grand.mean, dev.from.grand.mean,
squared.devs, gp.sizes, wt.squared.devs )
print(Y, digits = 2)
## gp.means grand.mean dev.from.grand.mean squared.devs gp.sizes
## Firestone 35 35 -0.036 0.0013 15
## Bridgestone 32 35 -3.055 9.3329 15
## Milestar 35 35 -0.074 0.0055 15
## Falken 38 35 3.165 10.0165 15
## wt.squared.devs
## Firestone 0.019
## Bridgestone 139.994
## Milestar 0.082
## Falken 150.247
SSb <- sum( wt.squared.devs )
print( SSb )
## [1] 290.3425
# four groups and N = 60 total observations
pf( 18.6, df1 = 3, df2 = 56, lower.tail = FALSE)
## [1] 1.70257e-08
aov( formula = Mileage ~ Brands, data = tires )
## Call:
## aov(formula = Mileage ~ Brands, data = tires)
##
## Terms:
## Brands Residuals
## Sum of Squares 290.3425 249.3583
## Deg. of Freedom 3 56
##
## Residual standard error: 2.110172
## Estimated effects are balanced
aov( tires$Mileage~ tires$Brands )
## Call:
## aov(formula = tires$Mileage ~ tires$Brands)
##
## Terms:
## tires$Brands Residuals
## Sum of Squares 290.3425 249.3583
## Deg. of Freedom 3 56
##
## Residual standard error: 2.110172
## Estimated effects are balanced
aov( Mileage~ Brands, tires )
## Call:
## aov(formula = Mileage ~ Brands, data = tires)
##
## Terms:
## Brands Residuals
## Sum of Squares 290.3425 249.3583
## Deg. of Freedom 3 56
##
## Residual standard error: 2.110172
## Estimated effects are balanced
my.anova <- aov( Mileage~ Brands, tires )
class( my.anova )
## [1] "aov" "lm"
print( my.anova )
## Call:
## aov(formula = Mileage ~ Brands, data = tires)
##
## Terms:
## Brands Residuals
## Sum of Squares 290.3425 249.3583
## Deg. of Freedom 3 56
##
## Residual standard error: 2.110172
## Estimated effects are balanced
summary( my.anova )
## Df Sum Sq Mean Sq F value Pr(>F)
## Brands 3 290.3 96.78 21.73 1.84e-09 ***
## Residuals 56 249.4 4.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tires$Brands <- ordered(tires$Brands, levels=c("Firestone","Bridgestone","Milestar","Falken"))
tire_aov <- aov(Mileage ~ Brands, data = tires)
summary(tire_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brands 3 290.3 96.78 21.73 1.84e-09 ***
## Residuals 56 249.4 4.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
str(tires)
## 'data.frame': 60 obs. of 2 variables:
## $ Brands : Ord.factor w/ 4 levels "Firestone"<"Bridgestone"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Mileage: num 33 36.4 32.8 37.6 36.3 ...
tire_aov <- aov(Mileage ~ Brands, data = tires)
summary(tire_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brands 3 290.3 96.78 21.73 1.84e-09 ***
## Residuals 56 249.4 4.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_result2 <- aov(Mileage ~ Brands, data = tires)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brands 3 290.3 96.78 21.73 1.84e-09 ***
## Residuals 56 249.4 4.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#summary(anova_result)
anova_result <- aov(tires$Mileage ~ tires$Brands)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## tires$Brands 3 290.3 96.78 21.73 1.84e-09 ***
## Residuals 56 249.4 4.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(anova_result)
# Run three seperate t test
#t.test( Firestone, Bridgestone, var.equal = TRUE ) # Student t-test
#Firestone <- with(tires, Mileage[Brand == "Firestone"]) # Mileage change due to Firestone
# Bridgestone <- with(tires, Mileage[Brand == "Bridgestone"]) # mood change due to placebo
pairwise.t.test( x = tires$Mileage, # outcome variable
g = tires$Brands, # grouping variable
p.adjust.method = "none" # which correction to use?
)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: tires$Mileage and tires$Brands
##
## Firestone Bridgestone Milestar
## Bridgestone 0.00025 - -
## Milestar 0.96092 0.00029 -
## Falken 0.00011 5.9e-11 9.6e-05
##
## P value adjustment method: none
tukey_result <- TukeyHSD(anova_result)
plot(tukey_result)
| Columns | Contains | Type | |:-|:-|:-| |Brands|What brand of tire
|factor| |Mileage|Tire life in thousands |num|
State and check the assumptions associated with ANOVA.
samples are independent, equal variance, normal since p=0.518 for Firestone, p= 0.04361 Falken, and p= 0.6078 Bridgerstone
Visualize your data.
See above
Conduct ANOVA and provide an interpretation.
F((3,56) = 21.73, p< 0.001) One-way ANOVA showed a large difference between the mileage among brands.in other words there is a significant main effect of brands in mileage
Perform a proper pairwise comparison test and control for multiple comparisons.
** **
Interpret your results and clearly state your conclusions.
the pairwise comparison indicates that Milestar-Firestone has a p value of 0.96092 which indicates a non significant difference between the mileage of tire brands. Milestar-Bridgestone had a p value of 0.00029 which indicates a significant difference between the mileage of tire brands. Falken-Firestone had a p value of 0.00011 which indicates a significant difference between the mileage of the tire brands. Falken-Bridgestone had a p value of 5.9e-11 which also indicates a significant difference between the mileage of the tire brands. as Falken tires brand has more miles. Firestone-Bridgestone had a p value of 0.00025 which is significantly different.
Problem 2 (15 pts)
(15pt) An experiment was run to determine whether four specific firing temperatures affect the density of a certain type of brick. The experiment led to the following data.
library(knitr)
Density <- c(21.8, 21.9, 21.7, 21.6, 21.7, 21.7, 21.4, 21.5, 21.4, 21.9, 21.8, 21.8, 21.6, 21.5, 21.9, 21.7, 21.8, 21.4)
Temperature <- c(100, 100, 100, 100, 100, 125, 125, 125, 125, 150, 150, 150, 150, 150, 175, 175, 175, 175)
(brick <- data.frame(Temperature, Density))
brick$Temperature <- ordered(brick$Temperature, levels=c("100", "125", "150", "175"))
Does the firing temperature affect the denisty of the bricks? Use \(\alpha = 0.05\). temperatures affect the density of a certain type of brick. The experiment led to the following data.
brick_aov <- aov(Density ~ Temperature, data = brick)
summary(brick_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Temperature 3 0.1561 0.05204 2.024 0.157
## Residuals 14 0.3600 0.02571
There is no difference between the firing temperature and the density of the bricks (F(3,14) = 2.02, p =.157).
Yes it is satisfied given the results of shapiro test
p>0.07151 and box plot. there is no statistical difference between
the brick datasets and a normal distribution
par(mfrow=c(2,2))
plot(brick_aov)
shapiro.test(brick$Density)
##
## Shapiro-Wilk normality test
##
## data: brick$Density
## W = 0.90544, p-value = 0.07151
#ggplot(data=brick,mapping=aes(x=Temperature, y=Density,fill=Temperature))+geom_boxplot()
boxplot(brick$Density ~ brick$Temperature)
No, it might seem that 125 influence brick density but its difficult to tell a difference given the amount of samples
Problem 3 (15 pts)
(7pt) Four different designs for a digital computer circuit are being studied to compare the amount of noise present. The following data have been obtained:
CircuitDesign <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
Noise <- c(19, 20, 19, 30, 8, 80, 61, 73, 56, 80, 47, 26, 25, 35, 50, 95, 46, 83, 78, 97)
(digital_computer_circuits <- data.frame(CircuitDesign, Noise))
digital_computer_circuits$CircuitDesign <-ordered(digital_computer_circuits$CircuitDesign, levels = c("1", "2", "3", "4"))
#ggplot(data = digital_computer_circuits, mapping = aes(x = CircuitDesign, y = Noise, fill = CircuitDesign))+ geom_boxplot()
Is the amount of noise present the same for all four designs? Use \(\alpha = 0.05\).
(circuit_design_aov <- aov(Noise ~CircuitDesign, data = digital_computer_circuits))
## Call:
## aov(formula = Noise ~ CircuitDesign, data = digital_computer_circuits)
##
## Terms:
## CircuitDesign Residuals
## Sum of Squares 12042.0 2948.8
## Deg. of Freedom 3 16
##
## Residual standard error: 13.57571
## Estimated effects are balanced
summary(circuit_design_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## CircuitDesign 3 12042 4014 21.78 6.8e-06 ***
## Residuals 16 2949 184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ggplot(data = digital_computer_circuits, mapping = aes(x = CircuitDesign, y = Noise, fill = CircuitDesign))+ geom_boxplot()
shapiro.test(digital_computer_circuits$Noise)
##
## Shapiro-Wilk normality test
##
## data: digital_computer_circuits$Noise
## W = 0.9343, p-value = 0.1867
digital_computer_circuits$CircuitDesign <- ordered(digital_computer_circuits$CircuitDesign, levels=c("1", "2", "3", "4"))
anova_result <- aov(digital_computer_circuits$Noise ~ digital_computer_circuits$CircuitDesign)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## digital_computer_circuits$CircuitDesign 3 12042 4014 21.78 6.8e-06 ***
## Residuals 16 2949 184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit <- lm(digital_computer_circuits$Noise ~ digital_computer_circuits$CircuitDesign)
plot(fit)
par(mfrow=c(2,2))
plot(circuit_design_aov)
Yes, normality and residual assumptions were met but design 4 could be an outlier
Design 1