Carrying out a statistical analysis

1 Assessment
2 R scripts (again)
- 2.1 A template for Rscripts
- 2.2 The first line of every R script (?)
3 Can we use embryonic stem cells to treat heart attack (in sheep)
- 3.1 Plot the data
- 3.2 Does the data fit the assumptions of your analysis
4 In mammals, does body weight correlate with how long pregnancy lasts
5 Which drug is better?

1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1070/MB1080 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed there and on the front page of the BS1070/MB1080 blackboard site. This assessment contributes 5% of module marks. You will receive feedback on this assessment after the submission deadline.

Something new from this assessment on. I would like you to upload to blackboard the script with which you carried out this session. As well as every command you need, I would also like you to add comments briefly explaining what this line does (see section 5 for an example). This is not marked, but will be the plagiarism test for this assessment. It is very likely the commands will be very similar, but if the comments are identical, I will become suspicious. You may need to upload this as a docx file to blackboard. If so just cut and paste your script into MS Word. Don’t worry about formatting etc. The upload can be found on the BS1070/MB1080 Blackboard site under Assessments and Feedback/Data analysis scripts.

2 R scripts (again)

In Rstudio go to File and then New File and then R script. An empty sheet appears. I tend to write all my commands in there. Then I highlight them and press Run to test them. Save this script often. Use this handy tip, it will save you a load of headaches. If something crashes the first thing the demonstrators will ask is “where is your script”.

2.1 A template for Rscripts

Its probably worth getting into the habit of creating a standard template for your scripts. This webpage gives some nice ideas. The first two ideas, especially deserve some reading time at home.

2.2 The first line of every R script (?)

It’s usually a good idea to clear R’s memory at the beginning of a session. Think about if you were working on a script and it created a variable druguse. Later on, you start working on a different analysis looking at a different drug, but you also called the variable druguse (because you don’t have that much imagination). Could you see, it could be very easy to think you are working on one thing, but you are actually working on another? Therefore, clear the decks at the start of every script

# clear the decks
rm(list = ls())

3 Can we use embryonic stem cells to treat heart attack (in sheep)

Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Each sheep in the study was randomly assigned to the ESC or control group, and the change in their hearts’ pumping capacity was measured. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery. This question can be analysed using a two-mean test (t-test or wilcoxon test).

3.1 Plot the data

The first thing to do is use the dfSummary command (package summarytools) to just have a quick look at the stem.cell data. The data is available in the openintro package, so you’ll need to install and load this. We first did this in session 1. We also used dfSummary back in session 1.

Blackboard MCQ: Per your dfSummary, what are the two treatments (trmt) called in this study?

I think the easiest way to see if there is a change in pumping capacity is to create a new variable which calculates the difference between after and before. Use mutate (session 3) to create a new dataframe called mystem with the additional variable of the difference between after and before.

Next you want to have a visual check to see if treatment had an effect on the difference in pumping capacity. Plot this difference (y-axis) using a boxplot with trmt as the x-axis. You encountered boxplots back in session 2.

3.2 Does the data fit the assumptions of your analysis

A t-test has a number of assumptions. In the lecture, we touched upon the need for normality. Without even checking, I would suggest the small sample size will mean the data probably is not normal. So for today use the wilcoxon test (code given in the lecture) to see if there was a significant effect of ESC on the ability to recover from a heart attack.

Blackboard MCQ: From your boxplot and wilcoxon test, does treatment with embryonic stem cells improve heart function following a heart attack.

4 In mammals, does body weight correlate with how long pregnancy lasts

Naively, we might assume that bigger mammals take longer in the womb to develop. In the openintro package, we have a data set available to answer this question. Here, we are going to write a script to carry out a full analysis to answer this question.

4.1 Plot the data

The first thing to do is use the dfSummary command (package summarytools) to just have a quick look at the mammals data. The data is available in the openintro package, so you’ll need to install and load this. We first did this in session 1. We also used dfSummary back in session 1.

Blackboard MCQ: What is the maximum amount of sleep (TotalSleep) for a given species?

Next you want to have a visual check to see if bodyweight and gestation period are correlated. Draw a scatterplot to help you understand. We first created a scatterplot in session 2.

Looks pretty odd because there are a couple of species that are so much larger than the rest. This is where a log scale can come in handy. Use mutate (session 3) to create a new dataframe called mymammals with the additional variable of the log to the base 10 (log10) of bodyweight. Make a new scatterplot with this

Blackboard MCQ: From this graph, what happens to gestation period as bodyweight increases?

Extra challenge: Can you figure out how to add data labels to this figure so that you can identify each animal?

4.2 Does the data fit the assumptions of your analysis

Remember to do a pearson’s correlation, data must be;

both variables should be normally distributed
linearity (straight line relationship between each of the two variables)

No need to check normality here, as its clear from the graph that the relationship is not linear. Remember for Spearman’s, the relationship only needs to be monotonic.

4.3 Carry out the correlation

So the Spearman rank correlation looks like the right test here. Instructions for it can be found here. Important You should use the raw data here as it makes the analysis simpler and easier to explain

Correlation coefficient is comprised between -1 and 1:

-1 indicates a strong negative correlation : this means that every time x increases, y decreases
0 means that there is no association between the two variables (x and y)
1 indicates a strong positive correlation : this means that y increases with x

Blackboard MCQ What rho value did you find for your Spearman rank correlation between bodyweight and gestation?

5 Which drug is better?

Rosiglitazone and pioglitazone are drugs used to treat type 2 diabetes. There is some idea that they may increase rates of acute myocardial infarction (AMI). In this study, they looked at the long term occurrence of AMI in groups of patient given either drug. You can find the data inside the package openintro, dataset avandia.

This is exactly the sort of data we can use a chi-squared test for. When you collect small datasets you’ll often put the table in directly as I did in the lecture. This dataset has 227571 observations. So the first thing we need to do is make a table out of this raw data.

library("openintro")  # Making sure the package is 'on'
medicine <- xtabs(~cardiovascular_problems + treatment, data = avandia)  # Makes the table medicine
medicine  # prints it to the console to let you have a look

                       treatment
cardiovascular_problems Pioglitazone Rosiglitazone
                    no        154592         65000
                    yes         5386          2593

Just a quick look at this data suggests that there is not much difference, do you agree? Unfortunately, the numbers are so big that its hard to really get a handle on them. This is where figures often come in handy.

barplot(medicine, legend = T, beside = T, main = "AMIs by drug treament")

No, that doesn’t really help either. Lets try calculating percentages.

library("fastR")  #This package contains the col.perc function. Have you installed it?
col.perc(medicine)

                       treatment
cardiovascular_problems Pioglitazone Rosiglitazone
                    no    0.96633287    0.96163804
                    yes   0.03366713    0.03836196

So about 3.4% of Piolitazone patients develop AMI, whereas about 3.8% of the Rosiglitazone do. The difference is clearly very small, but here we are interested in if it is significant.

chisq.test(medicine)


    Pearson's Chi-squared test with Yates' continuity correction

data:  medicine
X-squared = 30.818, df = 1, p-value = 2.834e-08

Well there you go. Remember how we would report this: "Rosigliatazone patients have a slightly higher rate of AMIs compared to Piolitazone patients (3.8% vs 3.4%, Chi-squared = 30.818, df = 1, p-value = \(2.834 \times 10^{-8}\))

This is a nice place to finish with considering the limits of statistical analysis. Above, we found a statistically significant result. However, biologically it seems quite small. What if I told you Rosigliatazone cost £1 a day for treatment whereas Piolitazone causes £150 per day. Is the effect still “important”?