What does xkcd have to say?

What is my question?

The first thing to have in starting a research project is a question. What is that you are trying to find out?

What is your question?

What kind of data will I collect?

In deciding how to answer your question, it matters what sort of data you collect.

There are three types:

Categorical: eg species, breed, gender, type of fruit etc. There is no natural ordering of these data.

Ordinal: eg score on a Likert scale, or a pain score, where the data have a definite order but are not really numerical. A score of 8 definitely indicates more pain than a score of 4, but not necessarily twice as much.

Numerical or constant difference: eg temperature, weight, height, time. These are data where the difference between, say, a weight of 10 kg and one of 11 kg is the same as the difference between one of 14 kg and one of 15 kg.

What kind of data are you collecting?

Trend or difference?

Roughly speaking, research questions fall into two categories.

Trend questions are those that look to see whether there is a correlation between the value of one variable and the value of another. Does knowing something about one variable tell you anything about the value of the other? Typically, the data used here are numerical: you could plot one variable against the other on a scatter plot.

For example, these researchers looked to see if the rectal temperature of a dog as measured using a contact thermometer was correlated with its surface temperature as measured using an infra red thermometer.

Or we might look at the growth of female puppies:

Do you think the two types of measurement are correlated in these cases?

Difference Questions

Very often, we have two groups which have been exposed to a different treatment and we want to know if the treatment makes a difference to some outcome. This is a difference question.

Very often, difference questions involve one or more categorical variables (eg analgesic) which may have two or more ‘treatments’ eg brand A and brand B (but there could also be C and D etc) and then some response variable which is often either ordinal (eg pain score) or numerical (eg temperature).

For example:

Are pain scores lower after a certain time in cats if analgesic A is used rather than analgesic B for a particular type of operation?

Is there a seasonal impact on sales of Simparica and Prinovox?

Do nurses and owners give different BCS scores?

Are you aware of the following dental products and do you use them?

For each of the above, decide what kind of data is being gathered, for each axis of the plots. Are the data categorical, ordinal or numerical?

Your own project

For your own project, decide whether you are asking a trend or a difference question.

Hypotheses

Can you frame your project question as a pair of hypotheses?

For every question that your project addresses, we can normally formulate a null hypothesis and an alternate hypothesis.

The null hypothesis is typically the ‘nothing going on’ or ‘no effect’ scenario, while the alternate is the ’something is going on, there is an effect scenario.

For example, if your question were:

Does the choice of analgesic affect pain scores in felines two days after an operation?

The null hypothesis for this would be:

H0: Choice of analgesic has no effect on pain score

The alternate hypothesis would be:

H1: Choice of analgesic does affect pain score

For your own research question, write a null and an alternate hypothesis.

Answering Questions

Having framed our questions as a pair of hypotheses, we gather data in order to answer the question as best we can.

We have to decide how to gather data, and how much we will need in order to get an answer in which we can have sufficient confidence.

Remember that your question is typically about a population (eg all female cats) but we will only have data from a sample of that population (eg the female cats that came to your practice during a three month period). We have to infer from whatever difference, lack of difference, trend or lack of trend we see for that sample, whether there is a difference or trend for the population as a whole.

We do that by deciding whether or not the data allow us to confidently reject the null hypothesis.

The null hypothesis is the most important

In the end, when we have our data, we will ask how likely it is that we could have got that data if the null hypothesis were true. So we start with thte presumption that it is true.

It is very like the way it works in a court. The presumption is that the defendant is innocent - that is the null hypothesis. The judge and jury ask themselves how likely it is that there could be the evidence they have been shown if that were so (ie if the null hypothesis were true)

On this basis, the judge decides whether to acquit (fail to reject the null hypothesis) or convict (reject the null hypothesis). If a defendant’s fingerprints were found at the crime scene, stolen goods were found at his house and video cameras captured him leaving and entering the building, then the judge would likely decide that it would be very unlikely that all this evidence could have been gathered if the defendant were innocent, and so would probably decide to reject the null hypothesis (innocence) and convict.

If only very weak evidence were presented to the court, then the judge would likely decide that there was insufficient evidence to convict and would acquit the defendant. She would have failed to reject the null hypothesis. Note that this does not mean that the judge is saying that the defendant is innocent! The defendant might well be, but, equally, they might not be. The court can never know for sure. It can only act on the evidence presented to it.

In your project, on the basis of your data, you will either reject the null hypothesis or fail to reject it. If you fail to reject it, you are not saying it is necessarily true, just that, from your data, you do not have sufficient evidence to reject it.

Just as in a trial, where it must be galling for detectives if a noted mobster is acquitted when they know he has been behind a number of crimes, just because insufficient evidence was brought to the court, so it would be a shame in your project if there really were a trend or difference in what you were investigating but you were unable to reject your null hypothesis - which says that there was no effect, just becuase you had not gathered the right kind or enough data.

In the end,to decide whether you have evidence to reject the null hypothesis, you carry out the appropriate statistical test. Which test you select largely depends on the nature of the data.

The test will tell you how likely it is that you could have got the data you got if the null hypothesis were true. This probability is called a p value.

Sometimes we can estimate a p-value from a plot of the data

But sometimes, even before we carry out a test, its outcome is evident from a plot:

For example, if you saw these results that compared the distribution of post-operative pain scores following ovariohysterectomy in a sample of 127 bitches for two analgesics. 63 of the bitches were treated with one analgesic, 64 with the other:

What do think any fancy statistical test is going to tell us?

How likely do you think it is that we would have got two groups of scores like this if <insert suitable null hypothesis> were true?

Do you think that the test would tell us to reject the null hypothesis or to fail to reject it?

The p-value

The p value, then is just a number that we get from a statistical test after we give it our data.

It tells us the probability that we would have got the data we got if the null hypothesis were true

So it is a probability - it will always be a number between 0 and 1.

A small p value (eg 0.0016) tells us that we are unikely to have got our data if the null hypothesis were true. We might then decide to reject the null hypothesis. In doing this we are deciding that the difference or trend we see for our sample is a real reflection of what is true for the population and is not down to chance.

A large p value (eg 0.4) tells us that there is a fair chance that we might have got the data we got if the null hypothesis were true. On this basis we would decide not to reject the null hypothesis. Just like in the court, that does not mean we are saying that the null is true, just that our data do not give us sufficient evidence to reject it.

Sometimes we have a good idea of roughly what the p value will be even before we do the test, just be looking at plots of our data. That is why we should always plot our data before we do any fancy test on it. A clear trend or clear difference for our sample normally suggests that there is an effect and so that our test, if we it carry out, will give a low p-value so that we will reject our null hypothesis. If we don’t see a difference or we don’t see a trend, then probably our test will end up giving us a high p-value and we will not reject our null.

Sometimes it is not so clear, however, so we really do need to do the test to find out what the p-value is and so whether to reject the null.

Look at the figures above and decide for which of them the p-value would be small, for which it would be large, and for which it is hard to tell.

How small does the p value need to be before we reject the null hypothesis?

According to XKCD…

It’s like asking, how much evidence do we need before it is reasonable to convict a defendant? Really, there is no firm answer to this. But it should be small. The most common threshold that researchers take is 0.05. So if you did that too, you would reject your null hypothesis if the test you carried out gave you a p-value of less than 0.05, and you would fail to reject it if it gave a p value of greater than 0.05.

References

Cugmas, B. et al. (2020) ‘Comparison between rectal and body surface temperature in dogs by the calibrated infrared thermometer’, Veterinary and Animal Science, 9, p. 100120. doi: https://doi.org/10.1016/j.vas.2020.100120.