When dealing with data, boxplots can be a very useful tool to visualise patterns more easily.
First of all, we will generate our datasets:
set.seed(1)
dataset1<-rnorm(50)
dataset2<-rnorm(50)+1
Then, we will plot them as boxplots:
data<-c(dataset1,dataset2)
labels<- c(rep("Dataset nº1", 50), rep("Dataset nº2",50))
data_labels <- data.frame(data, labels)
boxplot(data ~ labels, col = c("light grey", "dark grey"));
When we wish to find out whether the difference between means is statistically relevant, we carry out a t-test where the p.value can be found and so compared to a threshold or alpha value.
require(graphics)
result<-t.test(dataset1,dataset2)
str(result$p.value)
## num 1.78e-07
If we want to perform a one tailed t-test, we have to work with the parameter “alternative”, where we specify either “less” or “greater”, resulting in right tail and left tail respectively. This is particularly useful in cases where having a t-test carried out where it makes no sense for the results to be below or above a certain value. For example, if we are studying the effects of paracetamol on lowering body temperature, it makes no sense to consider that it would make body temperature rise, so one-tailed t-test is more adequate. Something similar happens with sleeping pills, they are meant to increase the amount of sleep a patient gets, not lower it, so again, a one-tailed t-test is in order.
Sometimes, there can be high levels of variability between our test subjects, meaning that for example two people can metabolize a drug in completely different ways depending on their age, gender, or even weight. If we were to test our hypothesis just like we have done before, there may be inaccuracies in the results. If we give a placebo to one group of people and the drug to another, the results can be completely unreliable. When there is risk of high variability in the test-subjects, a paired t-test is more reliable. The principle is simple: the same person is tested/measured at different time points. In our example this means we measure a person’s vitals with a placebo, and later on measure the same person’s vitals while taking the drug whose efficiency we want to measure. We will therefore have two sets of data points for each patient. After collecting this data, we find the difference between a patient’s two data points and treat the result as a single random variable.
We get that: \(\frac{\overline{d}}{\sqrt\frac{S_d2}{n}}\) will behave as a t student distribution with \(n-1\) degrees of freedom (n being the number of diferences and d being the dataset made up of said differences)
In order to carry out a paired t-test in R, we need to load the data correctly and then indicate a T for the ‘paired’parameter of the t-test function. To load the data correctly, we need to build a dataframe with the patiend number or ID and the two data points associated to said patient, then we will associate this dataframe to the ’data’ parameter of the t-test function. The example below deals with hours of sleep without taking any pills and taking a sleeping pill.
patientid<-c("Patient 1", "Patient 2", "Patient 3", "Patient 4")
beforedrug<-c(5,4,6,5)
afterdrug<-c(7,6,8,8)
table<-data.frame(patientid,beforedrug,afterdrug)
result1<- t.test(beforedrug,afterdrug,data=table,paired=T)
str(result1$p.value)
## num 0.0029
#we can compare this to a regular t-test:
result2<-t.test(beforedrug,afterdrug)
str(result2$p.value)
## num 0.0122
As it can be seen, the paired t-test indicates that the difference is relevant while the regular t-test indicates it is not. The regular t-test is unreliable because there is a very significant difference in sleep patterns from one individual to another, therefore it gives us a false result that indicates the sleeping pill does not affect the average hours of sleep, when in reality it does.
Sometimes, we can have outliers in our data, and in those cases, the t-test loses accuracy, similar to the loss in accuracy for the cases where paired t-tests should be used. In this case, we will use a wilcoxon test to deal with data sets that have outliers.
We will first create some data with outliers:
set.seed(123)
#generate 50 random normal numbers from N(0,1)
Data1 <- rnorm(50)
#generate 50 random normal numbers from N(1,1)
Data2 <- rnorm(50) +1
Data2outliers<-c(Data2,100)
boxplot(Data1, Data2outliers, col = c("red", "yellow"))
Then, we will carry out the wilcoxon test and compare it to the result obtained from the t-test:
result_ttest<-t.test(Data1, Data2outliers)
result_wilcoxon<-wilcox.test(Data1,Data2outliers)
str(result_ttest$p.value)
## num 0.123
str(result_wilcoxon$p.value)
## num 1.03e-07
Once again, we can see that the t-test considers the mean difference to be insignificant whereas the wilcoxon test shows it actually is statistically relevant.
This test will always have the same hypothesis: The null hypothesis will state that the sample we are analysing comes from a normal distribution. The alternative hypothesis states that the distribution is not normal.
If we find a p.value below our significance level, then the data does not come from a normal distribution, and this means we cannot apply a t-test to the data. This can be useful when deciding which test to carry out.
shapiroresult1<-shapiro.test(Data1)
shapiroresult2<-shapiro.test(Data2outliers)
str(shapiroresult1$p.value)
## num 0.928
str(shapiroresult2$p.value)
## num 1.92e-15
As we can see, Data 1 comes from a normal distribution and Data 2 with outliers does not