# Read the data into a dataframe called df2 using the read.csv function.
df2 <- read.csv("WIBS.csv", header=TRUE)
# Correct the column names for creating graphs
colnames(df2)=c("Group","Mean wt 150 μl (g)", "Mean wt 1 ml (g)")
# Plot the hisogram for the Mean wt 150μl (g) column.
hist(df2$"Mean wt 150 μl (g)", xlab="Mean wt 150 μl (g)", main="Histogram of Mean wt 150μl (g)")The Biochemistry Class Data
I have the data for a class which did a very similar experiment. The only different was the specific volumes that they were asked to measure. This was because the protocol for the experiment has changed slightly over the years. The data is available as a csv file called WIBS.csv (WIBS stands for Working in Bioscience and it was the module that preceded the current Biochemistry module).
The data gives the average weights for each of the groups carrying out the experiment in one laboratory session. Note that each group will give the average from 5 measurements and that there are no standard deviations recorded.
We can read this data into R and carry out an analysis.
The reason for plotting the histogram is to see the shape of the data. In this case for the 150 \(\mu l\) data there is an extreme outlier at 0.4118 g. This is nearly 3 times the expected value. If they were using the P200 this is an impossible value as the piece of equipment cannot measure volumes above 200 \(\mu l\).
Now I am going to do something scientifically bad but something done by almost all scientists some time in their career. I am going to create a story to explain this bad value. The technical names for this is posterior reasoning. It means creating the explanation after the experiment when you have seen the data, which we should never do. I am going to suggest that this outlier occurred because the students forgot to TARE the scale and so they kept measuring the weight of the water and the boat used to contain the sample for weighing.
This value needs to be removed as it affects the rest of the dataset. You can either remove both measurements from this group or just the single measurement. Their second measurement looks alright and so I will just set the first measurement to a missing value or NA.
# Replace the outlier with NA
df2[17,2] <- NA
# Display a histogram for the new data
hist(df2$"Mean wt 150 μl (g)", xlab="Mean wt 150 μl (g)", main="Histogram of Mean wt 150μl (g)")Again the value 0.215 is above the maximum for the P200 by 15 units or 7.5%. This is unlikely. Looking at this groups measurements for both pipettes they are at the high end and I suspect that they used the wrong stop to fill and empty the pipette and this is an issue of user error.
You could keep in their results or remove them. This is where statistics becomes SUBJECTIVE. There is no right or wrong answer. The only wrong thing you can do is remove them and hide that you removed them and edited the data. As long as you are transparent and reason your changes they will most often be acceptable. If you want to be particularly cautious you can compare the results with the changes and without and assess the sensitivity of your results to the changes.
Here I first removed the value in a new dataframe and checked the new histogram.
I have then summarised the data and generated a boxplot with and without the second value removed.
# Copy the dataframe to a new dataframe
df3 <- df2
# Remove the outlier and replace it with NA in the new dataframe
df3[11,2] <- NA
# Create a histogram for the new data
hist(df3$"Mean wt 150 μl (g)", xlab="Mean wt 150 μl (g)", main="Histogram of Mean wt 150μl (g)")# Create a five figure summary for the data with the extreme outlier removed
summary(df2[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1293 0.1474 0.1490 0.1512 0.1505 0.2150 1
sd(df2[,2], na.rm=TRUE)[1] 0.01251104
# Create a five figure summary for the data with both outliers removed
summary(df3[,2]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1293 0.1474 0.1490 0.1494 0.1505 0.1773 2
sd(df3[,2], na.rm=TRUE)[1] 0.00644409
# Create a boxplot for the data with the extreme outlier removed
boxplot(df2[,2], xlab="Mean wt 150 μl (g)")# Create a boxplot for the data with both outliers removed.
boxplot(df3[,2], xlab="Mean wt 150 μl (g)")From my personal perspective I think the removal of the two points is reasonable.
It produces a symmetrical error distribution and range, which is what I would expect. There is a slight change to the value of the mean but it is very small. The mean is very close to the expected mean of 0.150g. There is only an error in the final digit or 1 \(\mu l\). Which is a high degree of precision.
Removal of that additional outlier has a huge effect on the standard deviation between the groups, which is halved. Given what you can see in the boxplot this seems a more reasonable summary for the data which has a very narrow inter-quartile range.
This once again shows that data analysis requires input and decisions as to how you will proceed that are not completely objective. There is no absolute rule that says this is the correct analysis. I think that I have done the best analysis that I could but other might take a different opinion and analyse the data slightly differently.
Reflection
This has been quite a technical analysis of just the first column of data. We have covered dealing with outliers and using graphical methods of exploratory data analysis.
There is something else important that we have done here. The distribution we have is the distribution of the measured mean weights from the different groups. Each group collected a sample and this is the sample distribution.
The variability of the sample distribution depends on the sample sizes that contribute to that sample distribution. In this case the sample size is 5. The standard deviation of the sample distribution therefore depends on two things; the data and the sample size.
This standard deviation of sample means is called the standard error of the mean.
It is VERY IMPORTANT to understand that there is a difference between a standard deviation and a standard error.
I have to confess that I have been using this data for many years and until this year I had not realised that there was an issue with the second weight measurement for 1ml. I should have seen it clearly from the graphs but I did not realise its significance.
The second measurement was on the P1000 measuring 1ml. This is at the limit of the range of volumes that you can measure with the P1000. If you go beyond this value you will break the micro-pipette. This creates an interesting issue for errors. They cannot be symmetrical. There should only be a very limited amount of results that go above 1g and they should be very small variations.
However this is also a sample distribution and not the underlying distribution of the measurements. According to the central limit theorem as the sample size goes towards the limit of infinity the sample distribution will become normally distributed.
So let’s look at the data again in R.
# I have read in the data from the csv file again and removed the outliers from the first column.
df4 <- read.csv("WIBS.csv", header=TRUE)
colnames(df4)=c("Group","Mean wt 150 μl (g)", "Mean wt 1 ml (g)")
df4[17,2] <- NA
df4[11,2] <- NA
# Plot the histogram for the Mean wt 1ml (g) column.
hist(df4$"Mean wt 1 ml (g)", xlab="Mean wt 1 ml (g)", main="Histogram of Mean wt 1 ml (g)")This is what I would have expected to see given the asymmetry of errors and the small sample size. It is asymmetrical with a longer tail towards lower weights than expected and only small variations above 1g.
This is not normally distributed data and the mean and standard deviation should not really be calculated using the usual formula.
The median is a more robust measure of the location of the distribution than the mean where there are deviations from normality. The inter-quartile range is also a better estimate of variability than the standard deviation.
If it was normally distributed you can estimate the standard deviation from the inter-quartile range.
\[ s=\dfrac{IQR}{1.34898} \]
# A five figure summary of the data
summary(df4[,3]) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8243 0.9727 0.9889 0.9803 0.9999 1.0360
# The standard deviation of the data
sd(df4[,3])[1] 0.03879345
# The interquartile range for the data
IQR(df4[,3])[1] 0.0272
# Estimate of sd from IQR assuming normality
s <- IQR(df4[,3])/1.34898
s[1] 0.02016338
These results suggest that the true mean will be above 0.980 g and that the standard deviation will be below 0.039 g and above 0.020 g.
As this data does not look normal you can try and apply alternative distributions to fit the data.
You can use R to fit a different distribution to the data using the MASS library. You have to pick the family of distributions that it comes from. I have chosen the Weibull distribution.
I have plotted the histogram of the data along with the density (blue line) and the model (red)
library(MASS)
fit <- fitdistr(df4[,3], "weibull")Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
print(fit$estimate) shape scale
40.0216230 0.9950008
p <- fit$estimate
k <- as.numeric(p[1])
l <- as.numeric(p[2])
hist(df4$"Mean wt 1 ml (g)", xlab="Mean wt 1 ml (g)", main="Histogram of Mean wt 1 ml (g)", freq=FALSE, ylim=c(0,19))
curve(dweibull(x, shape=k, scale=l), from = min(df4[,3]), to = max(df4[,3]), add=TRUE, col = "red")
lines(density(df4[,3]), col="blue")This distribution is a reasonable fit to the data. You can then use this to calculate the mean and standard deviation. Where:
\[ \mu = \lambda \Gamma \bigg(1+\dfrac{1}{k} \bigg) \]
\[ s^{2}= \lambda^{2} \bigg[ \Gamma \bigg(1+\dfrac{2}{k}\bigg)- \bigg(\Gamma \bigg(1+\dfrac{1}{k} \bigg) \bigg)^2 \bigg] \]
You need to add these formulae to R as they are not pre-programmed.
mean_weibull <- l * gamma(1 + 1/k)
variance_weibull <- l^2 * (gamma(1 + 2/k) - (gamma(1 + 1/k))^2)
mean_weibull[1] 0.981251
sqrt(variance_weibull)[1] 0.03089584
Reflection
From my perspective I think that the curve-fitting and the calculation of the new mean and standard deviation was an interesting diversion. It added some understanding to dealing with skewed and not normally distributed data but in the end it showed that if we had just considered the data normally distributed anyway we would have obtained the same value for the mean and a similar standard deviation.
Was it worth all of the effort of curve fitting and assuming non-normality? In a word NO.
This is important as this was definitely skewed data and most people would probably have rejected normality. However, using the normality assumption gives perfectly reasonable results. If I had reported it as not normal and given the median and IQR then I would have underestimated the variability and overestimated the accuracy.
The final result suggests that the P1000 has an error of between 10 (from the median) and 20 \(\mu l\) in underestimating the volume when it is set to its upper limit. This is a systematic error in a single direction not due to the users.
This is 10-20 times larger than the symetrical error for the P200 but remember this is at the limit of the volumes that it can measure. It may be more accurate at other values.
If this error is still the same for measurements of 200 \(\mu l\) then it could have up to a 10% error!
A Bayesian Approach
Alternatively instead of fitting a Weibull curve to the data I could build a model assuming a skewed normal distribution and apply Bayesian methods to fit the model using STAN within the brms (Bayesian Regression Models using Stan) library.
Then I can check the model summary for the mean and the standard deviation.
library("brms")
set.seed(1234)
# fitting a brms model with a skew normal likelihood
data <- df4[,3]
model_skew <- brm(data ~ 1, family = skew_normal(), data = data)
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.5 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.226 seconds (Warm-up)
Chain 1: 0.095 seconds (Sampling)
Chain 1: 0.321 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.238 seconds (Warm-up)
Chain 2: 0.073 seconds (Sampling)
Chain 2: 0.311 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 9e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.154 seconds (Warm-up)
Chain 3: 0.041 seconds (Sampling)
Chain 3: 0.195 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 8e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.131 seconds (Warm-up)
Chain 4: 0.056 seconds (Sampling)
Chain 4: 0.187 seconds (Total)
Chain 4:
summary(model_skew) Family: skew_normal
Links: mu = identity
Formula: data ~ 1
Data: data (Number of observations: 38)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.98 0.01 0.96 0.99 1.00 2099 2033
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.04 0.00 0.03 0.05 1.00 2085 2011
alpha -4.62 1.70 -8.55 -1.97 1.00 1953 1444
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
In this case the mean is 0.98 but with a confidence interval of 0.96 to 0.99 and the standard deviation is 0.04 with a confidence interval of 0.03 to 0.05.
This makes it even more apparent that if we had assumed it was normal in the first place we could have saved a lot of computational effort and ended up with the same answer!