Exploratory Data Analysis:

I. Descriptive Statistic

## Datus$Batch: 1
##      Batch     Calcontent   
##  Min.   :1   Min.   :23.39  
##  1st Qu.:1   1st Qu.:23.40  
##  Median :1   Median :23.46  
##  Mean   :1   Mean   :23.46  
##  3rd Qu.:1   3rd Qu.:23.48  
##  Max.   :1   Max.   :23.56  
## ------------------------------------------------------------ 
## Datus$Batch: 2
##      Batch     Calcontent   
##  Min.   :2   Min.   :23.42  
##  1st Qu.:2   1st Qu.:23.46  
##  Median :2   Median :23.49  
##  Mean   :2   Mean   :23.49  
##  3rd Qu.:2   3rd Qu.:23.50  
##  Max.   :2   Max.   :23.59  
## ------------------------------------------------------------ 
## Datus$Batch: 3
##      Batch     Calcontent   
##  Min.   :3   Min.   :23.46  
##  1st Qu.:3   1st Qu.:23.49  
##  Median :3   Median :23.51  
##  Mean   :3   Mean   :23.52  
##  3rd Qu.:3   3rd Qu.:23.52  
##  Max.   :3   Max.   :23.64  
## ------------------------------------------------------------ 
## Datus$Batch: 4
##      Batch     Calcontent   
##  Min.   :4   Min.   :23.28  
##  1st Qu.:4   1st Qu.:23.37  
##  Median :4   Median :23.39  
##  Mean   :4   Mean   :23.38  
##  3rd Qu.:4   3rd Qu.:23.40  
##  Max.   :4   Max.   :23.46  
## ------------------------------------------------------------ 
## Datus$Batch: 5
##      Batch     Calcontent   
##  Min.   :5   Min.   :23.29  
##  1st Qu.:5   1st Qu.:23.32  
##  Median :5   Median :23.37  
##  Mean   :5   Mean   :23.36  
##  3rd Qu.:5   3rd Qu.:23.38  
##  Max.   :5   Max.   :23.46
## # A tibble: 5 × 3
##   Batch  mean     sd
##   <dbl> <dbl>  <dbl>
## 1     1  23.5 0.0687
## 2     2  23.5 0.0630
## 3     3  23.5 0.0688
## 4     4  23.4 0.0652
## 5     5  23.4 0.0650
Interpretation: The result above shows the descriptive statistics such as the mean and standard deviation by Batch. The mean is lowest for Batch 5 and highest for Batch 3.

II. Graph (Box-plot)

Figure 1: Box-plots of batches of raw material furnished by her supplier differ significantly in calcium content.

Interpretation: The boxplots above show that, at least for our sample, batches of raw materials Batch 3 seem to have the biggest calcium content, and Batch 5 have the smallest calcium content. There are a couple of outliers as shown by the points outside the whiskers in the group of Batch 2, 3 and 4.

Experimental Question: Is there a significant variation in calcium content from batch to batch? Use α=5%.

We want to test the hypotheses:
Ho: The calcium content of the batches of raw materials is all the same.
Ha: At least one batch of raw materials has a significantly different calcium content.*
We use significance level α = 0.05.

ANOVA Table

## Analysis of Variance Table
## 
## Response: Calcontent
##           Df   Sum Sq  Mean Sq F value   Pr(>F)   
## Batch      4 0.096976 0.024244  5.5352 0.003626 **
## Residuals 20 0.087600 0.004380                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  One-way analysis of means
## 
## data:  Calcontent and Batch
## F = 5.5352, num df = 4, denom df = 20, p-value = 0.003626
Interpretation: Given that the p-value = 0.003626 is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that the calcium content of the batches of raw materials is all the same. Thus, there is a significant variation in calcium content from batch to batch, at the 5% level of significance.

Post-hoc Test

To compare group means, we need to perform post-hoc tests. In order to see which group are different from the others, we need to compare group 2 by 2. Since there are 5 batches, we are going to compare batches 2 by 2 as follows:

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = Calcontent ~ Batch, data = Datus)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)   
## 2 - 1 == 0  0.03400    0.04186   0.812  0.92378   
## 3 - 1 == 0  0.06600    0.04186   1.577  0.52807   
## 4 - 1 == 0 -0.07800    0.04186  -1.863  0.36756   
## 5 - 1 == 0 -0.09400    0.04186  -2.246  0.20393   
## 3 - 2 == 0  0.03200    0.04186   0.765  0.93783   
## 4 - 2 == 0 -0.11200    0.04186  -2.676  0.09363 . 
## 5 - 2 == 0 -0.12800    0.04186  -3.058  0.04361 * 
## 4 - 3 == 0 -0.14400    0.04186  -3.440  0.01939 * 
## 5 - 3 == 0 -0.16000    0.04186  -3.823  0.00842 **
## 5 - 4 == 0 -0.01600    0.04186  -0.382  0.99509   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Interpretation: In the ouput of the Tukey HSD test, the first column shows the comparison which have been made; the last column shows the adjusted p-values for each comparison. There are 3 adjusted p-values are smaller than 0.05, only the 4-3, 5-2 and 5-3 difference is statistically significant while using a family error rate of 0.05. The mean difference between each comparison is -0.128, -0.144, and -0.160 respectively.
To visualize the result above, we plot it to have a simple visual assessment, and it provides more information than the adjusted p-values.

Interpretation: In the graph above, zero indicates that the group means are equal. If an interval does not contain zero, the corresponding means are significantly different. And in the chart, only the difference between 4-3, 5-2 and 5-3 is significant. These CI results match the hypothesis test results in the previous table.

Details of the syntax used in this output:

Below are the code to compute some basic descriptive statistics of our dataset.
by(Datus, Datus$Batch, summary)
## Datus$Batch: 1
##  Batch   Calcontent   
##  1:5   Min.   :23.39  
##  2:0   1st Qu.:23.40  
##  3:0   Median :23.46  
##  4:0   Mean   :23.46  
##  5:0   3rd Qu.:23.48  
##        Max.   :23.56  
## ------------------------------------------------------------ 
## Datus$Batch: 2
##  Batch   Calcontent   
##  1:0   Min.   :23.42  
##  2:5   1st Qu.:23.46  
##  3:0   Median :23.49  
##  4:0   Mean   :23.49  
##  5:0   3rd Qu.:23.50  
##        Max.   :23.59  
## ------------------------------------------------------------ 
## Datus$Batch: 3
##  Batch   Calcontent   
##  1:0   Min.   :23.46  
##  2:0   1st Qu.:23.49  
##  3:5   Median :23.51  
##  4:0   Mean   :23.52  
##  5:0   3rd Qu.:23.52  
##        Max.   :23.64  
## ------------------------------------------------------------ 
## Datus$Batch: 4
##  Batch   Calcontent   
##  1:0   Min.   :23.28  
##  2:0   1st Qu.:23.37  
##  3:0   Median :23.39  
##  4:5   Mean   :23.38  
##  5:0   3rd Qu.:23.40  
##        Max.   :23.46  
## ------------------------------------------------------------ 
## Datus$Batch: 5
##  Batch   Calcontent   
##  1:0   Min.   :23.29  
##  2:0   1st Qu.:23.32  
##  3:0   Median :23.37  
##  4:0   Mean   :23.36  
##  5:5   3rd Qu.:23.38  
##        Max.   :23.46
group_by(Datus,Batch)%>%
  summarise(mean = mean(Calcontent, na.rm = FALSE),
            sd= sd(Calcontent, na.rm = FALSE))
## # A tibble: 5 × 3
##   Batch  mean     sd
##   <fct> <dbl>  <dbl>
## 1 1      23.5 0.0687
## 2 2      23.5 0.0630
## 3 3      23.5 0.0688
## 4 4      23.4 0.0652
## 5 5      23.4 0.0650
Before actually performing the ANOVA in R is to visualize the data in relation to the research question. This can be done with the boxplot() function in base R.
boxplot(Calcontent~Batch, data = Datus)

Now, only the ANOVA can help us to answer the initial research question. ANOVA in R can be done in several ways, but we used the oneway.test function and lm function, since the outputs are exactly the same for both methods, which means that in case of equal variances, results and conclusions will be unchanged.
Datus$Batch<-factor(Datus$Batch)
 fit.lm<- lm(Calcontent~Batch, data=Datus)
 anova(fit.lm)
## Analysis of Variance Table
## 
## Response: Calcontent
##           Df   Sum Sq  Mean Sq F value   Pr(>F)   
## Batch      4 0.096976 0.024244  5.5352 0.003626 **
## Residuals 20 0.087600 0.004380                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
oneway.test(Calcontent~Batch, data = Datus,
            var.equal = TRUE)
## 
##  One-way analysis of means
## 
## data:  Calcontent and Batch
## F = 5.5352, num df = 4, denom df = 20, p-value = 0.003626
Furthermore, in terms of multiple testing we used the Post-hoc test in R which is the Tukey HSD, used to compare all groups to each other. Observe that, we also used the summary() and (aov) functions, a method to perform ANOVA because the (res_aov) are resused for the post-hoc test, then we used mcp() function.
res_aov <- aov(Calcontent~Batch, data = Datus)
summary(res_aov)
##             Df  Sum Sq Mean Sq F value  Pr(>F)   
## Batch        4 0.09698 0.02424   5.535 0.00363 **
## Residuals   20 0.08760 0.00438                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(res_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Calcontent ~ Batch, data = Datus)
## 
## $Batch
##       diff         lwr         upr     p adj
## 2-1  0.034 -0.09125152  0.15925152 0.9237686
## 3-1  0.066 -0.05925152  0.19125152 0.5280285
## 4-1 -0.078 -0.20325152  0.04725152 0.3675516
## 5-1 -0.094 -0.21925152  0.03125152 0.2038678
## 3-2  0.032 -0.09325152  0.15725152 0.9378237
## 4-2 -0.112 -0.23725152  0.01325152 0.0937535
## 5-2 -0.128 -0.25325152 -0.00274848 0.0436833
## 4-3 -0.144 -0.26925152 -0.01874848 0.0194205
## 5-3 -0.160 -0.28525152 -0.03474848 0.0083781
## 5-4 -0.016 -0.14125152  0.10925152 0.9950930
library(multcomp)
post_test<- glht(res_aov,
    linfct = mcp(Batch="Tukey"))
summary(post_test)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = Calcontent ~ Batch, data = Datus)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)   
## 2 - 1 == 0  0.03400    0.04186   0.812  0.92377   
## 3 - 1 == 0  0.06600    0.04186   1.577  0.52802   
## 4 - 1 == 0 -0.07800    0.04186  -1.863  0.36759   
## 5 - 1 == 0 -0.09400    0.04186  -2.246  0.20385   
## 3 - 2 == 0  0.03200    0.04186   0.765  0.93783   
## 4 - 2 == 0 -0.11200    0.04186  -2.676  0.09370 . 
## 5 - 2 == 0 -0.12800    0.04186  -3.058  0.04377 * 
## 4 - 3 == 0 -0.14400    0.04186  -3.440  0.01934 * 
## 5 - 3 == 0 -0.16000    0.04186  -3.823  0.00834 **
## 5 - 4 == 0 -0.01600    0.04186  -0.382  0.99509   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
plot(post_test)