1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1040 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed there and on the front page of the BS1040 blackboard site. This assessment contributes 2.5% of module marks. You will receive feedback on this assessment after the submission deadline.

2 Data manipulation with dplyr

Back in the last session we used ggplot2 rather than base R to draw graphs. This week, again, we are going to use a package to do something that base R can do. dplyr is part of the tidyverse, just like ggplot2. Again, the main advantage of using dplyr is consistency of commands. Also it introduces a rather cool idea called a pipe (more later). In the first half of today’s practical we are going to learn how to manipulate our data in R. Thats things like subsetting. But why do that in R, you probably already know how to do most of this in excel? Why not do it in excel and then just use R for data analysis. I’ll tell you why.

2.1 Raw data is precious, keep it under lock and key.

Imagine the situation, young Eamonn has spend the large part of a year collecting data on how ants measure the size of their nests. All this is written in hard copy in his paper lab notebook. He spends a week transcribing it all into excel. Let the analysis begin. He resorts the columns by how long each ant visit was. Of course he saves his data. He imports it into SPSS and discovers his data is rubbish. Panicking (a year wasted!), he quickly realises that the data in SPSS doesn’t match his hard copy. What went wrong? It turned out that when he resorted his data he didn’t include all the columns and it broke the link between his data. So, for example, the ant on row 5 of his visit duration column wasn’t the same ant as on row 5 of his visit start time column. EXPLETIVE DELETED, he says to himself and goes off and spends another week retyping all the data back in. I (for it was indeed me) learned a value lesson; raw data is precious. Get it on to your computer and never ever change the original version. Happy ending here

When you are working with real data, store it as a csv file and never change it again. But you say, okay I’ll have that safe, but why can’t I mess around with a copy on excel? Fair enough, but everytime you change something you need to make a note somewhere of exactly what you did. If you do your data manipulation in an R script, the action and the recording of the action are the same thing.

Previously, whenever an undergrad project student gave me data it was usually a complete mess (because unlike you they didn’t have this course). Below is an example of me taking a dataset in and cleaning it up in R (you shouldn’t run this, you don’t have the data). Everything recorded. I don’t expect you to understand every line but rather understand that those few lines of code represent a lot of equivalent excel work and its recording.

#Required libraries
library("readr")
library("magrittr", lib.loc="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
library("Amelia")
library("tidyr")

#Getting the data in
geotaxis <- read_csv("~/Dropbox/Projects/undergrad_fly_neo/data/geotaxis.csv")

#Converting to lowercase
colnames(geotaxis) <- colnames(geotaxis) %>% tolower()
names(geotaxis)[names(geotaxis) == 'insecticde concentration'] <- 'conc'


#Getting the data into the right shape
geotaxis$rep<- seq(1, 36, by = 1)
geotaxis$run<-NULL
geotaxis$rep <- factor(geotaxis$rep)
geotaxis <- gather(geotaxis, vial, number, `vile 1`:`vile 6`, factor_key=TRUE)


#Checking for missing values
missmap(geotaxis, main = "Missing values vs observed")
mytable <- xtabs(~strain+conc+vial, data=geotaxis)
ftable(mytable) # print table

geotaxis$number<-as.numeric(geotaxis$number)
geotaxis$conc<-as.factor(geotaxis$conc)
geotaxis<-as.data.frame(geotaxis)

2.2 Tidy data

Most statistical programmes, including R like data to be Tidy. Also called long format this is where each observation is in a single row and each column tells us the same thing about each observation. So for a data set about a weight loss drug, column 1 might be the replicate number, column 2 might be weight loss with column 3 telling me if the patient was given the drug or a control. The opposite to this is wide format or “no I won’t analyse your data” format where you have one column of drug induced weight loss and one column of control affected weight loss.

Blackboard MCQ Of the two below datasets, which is in long format and which is in wide format? (Tip: control, cond1 and cond2 are the bodyweights resulting from the three different treatments)

[1] "Alpha"

  subject sex control cond1 cond2
1       1   M     7.9  12.3  10.7
2       2   F     6.3  10.6  11.1
3       3   F     9.5  13.1  13.8
4       4   M    11.5  13.4  12.9

[1] "Beta"

   subject sex condition bodyweight
1        1   M   control        7.9
2        1   M     cond1       12.3
3        1   M     cond2       10.7
4        2   F   control        6.3
5        2   F     cond1       10.6
6        2   F     cond2       11.1
7        3   F   control        9.5
8        3   F     cond1       13.1
9        3   F     cond2       13.8
10       4   M   control       11.5
11       4   M     cond1       13.4
12       4   M     cond2       12.9

2.2.1 Subsetting data

You’ll often want to do this with your own data. There are three important dplyr verbs to learn. With dplyr (just like ggplot2), the dataset comes first.

2.2.1.1 select()

select() grabs columns. Lets have a look at the babies dataset inside the package openintro. Remember in the last session we used built-in data sets? This week we are also using build-in datasets but they are inside a package openintro. That means you have to install and attach the package first. If you can’t remember how to do that, look here

library("openintro") #You will have to install this package. General instructions on how to handle packages https://rpubs.com/eamonnmallon/644715
head(babies) #This just shows you the first few rows of babies, to show you, you hoave access now.

# A tibble: 6 x 8
   case   bwt gestation parity   age height weight smoke
  <int> <int>     <int>  <int> <int>  <int>  <int> <int>
1     1   120       284      0    27     62    100     0
2     2   113       282      0    33     64    135     0
3     3   128       279      0    28     64    115     1
4     4   123        NA      0    36     69    190     0
5     5   108       282      0    23     67    125     1
6     6   136       286      0    25     62     93     0

So lets say you just want to grab the column that tells you whether the mother smoked or not

library("dplyr")#You will have to install this package
mothers_smoking<-select(babies, smoke)

You’ve just created a dataset with the data you selected. If you want to explore more, have a look at some of the cool stuff you can do with select() here.

2.2.1.2 slice()

slice() captures numbered rows. These can be single, continous or discontinous groups

library("dplyr")#You will have to install this package
#Row 5
row5<-slice(babies, 5)

#6 continous rows
continous<-slice(babies, 7:12)

#Discontinous rows
dis<- slice(babies, c(5,8,23))

You used another function in the last example, c(). This creates a vector (a one dimensional array), basically a group of numbers.

I rarely use slice(), but there is a row verb I use all the time.

2.2.1.3 filter()

Lets say you wanted to look at babies born to mothers thirty five or older. filter() is how you would get this data out

thirtyfive_mum <- filter(babies, age >= 35)
summary(thirtyfive_mum)

      case             bwt          gestation         parity       
 Min.   :   4.0   Min.   : 55.0   Min.   :204.0   Min.   :0.00000  
 1st Qu.: 354.5   1st Qu.:108.0   1st Qu.:270.0   1st Qu.:0.00000  
 Median : 690.5   Median :120.0   Median :279.0   Median :0.00000  
 Mean   : 667.9   Mean   :120.5   Mean   :277.9   Mean   :0.06707  
 3rd Qu.: 995.2   3rd Qu.:134.0   3rd Qu.:287.2   3rd Qu.:0.00000  
 Max.   :1236.0   Max.   :174.0   Max.   :315.0   Max.   :1.00000  
                                  NA's   :4                        
      age            height          weight          smoke       
 Min.   :35.00   Min.   :53.00   Min.   : 98.0   Min.   :0.0000  
 1st Qu.:36.00   1st Qu.:62.00   1st Qu.:120.0   1st Qu.:0.0000  
 Median :37.00   Median :64.00   Median :134.0   Median :0.0000  
 Mean   :37.85   Mean   :64.04   Mean   :135.4   Mean   :0.3272  
 3rd Qu.:39.00   3rd Qu.:66.00   3rd Qu.:145.0   3rd Qu.:1.0000  
 Max.   :45.00   Max.   :70.00   Max.   :250.0   Max.   :1.0000  
                                 NA's   :1       NA's   :2

Can you see what you asked for in the age filter? This is a logical operator. Brush up, if you’re not sure about them.

Task Write the code required to get the mean body weight (bwt) of babies born to smoking and non-smoking mothers. Blackboard MCQ Based solely on the means you have calculated, which group of mothers have heavier birth weight babies?

3 Data transformation

Another dplyr verb, mutate(). In the below code, we make a new dataframe containing the original babies data plus a new column lnweight. You don’t need to keep making a new dataframe, it was just a bit simpler to show here.

babies1<-mutate(babies,lnweight =log(weight))

Task Using the above as a template can you add a column to babies1 called sheight, which is the square root of height?

Blackboard MCQ What is the total of this column. You’re going to need two exta things, 1)how to sum a column (google or guess) and what to do when a column has missing values (NAs), look up na.rm.

One thing to notice, dplyr verbs tend to only do one thing, but they do them very efficiently, they are a set of cook’s knifes not a swiss army knife.

4 Residuals

I mentioned in the lecture that it is the residuals that are important for normality, not the raw data. In this practical, I’m going to ignore that and just use the raw data. This is for the simple reason that what would I calculate the residuals off? I need a model (statistical test) to calculate residuals. We won’t hit our first statistical test till next week, so bear with me.

5 Is my data normal?

So how do you check if a dataset is normal? In the lecture, I said do a histogram and a qqplot. The code for both is below. Note that I got a bit swanky and put the two graphs in the same figure. Completely unneccessary, but cool no?

#Need the below to get both figures in a grid
#install.packages("gridExtra")
library("gridExtra")

library("ggplot2")
p1<- ggplot(babies, aes(bwt)) + geom_histogram()

p2<-ggplot(babies, aes(sample = bwt))+ 
  stat_qq() + stat_qq_line()

grid.arrange(p1, p2, nrow = 1) #Arranges p1 and p2 into a grid.

So, I hope you agree that data looks pretty normal?

Task Produce a histogram and qqplot for mothers’ age from this data set

Blackboard MCQ Is mothers’ age normally distributed based on your histogram and qqplot. Be strict here, it must be at least as good as the above figures.

Okay, I’m just showing off in the next section, but it does show the power of ggplot.

#Need the below to get both figures in a grid
#install.packages("gridExtra")
library("gridExtra")

library("ggplot2")
p1<- ggplot(babies, aes(bwt)) + geom_histogram() + facet_grid(.~smoke)

p2<-ggplot(babies, aes(sample = bwt))+ 
  stat_qq() + stat_qq_line() + facet_grid(.~smoke)

grid.arrange(p1, p2, nrow = 1) #Arranges p1 and p2 into a grid.

So I divided the data up into smokers and nonsmokers on the fly. Can you see how its done?

6 Pipe

The last thing we are going to cover today are pipes. This is perhaps a little advanced but is so awesome I can’t resist. Pipes are built into a package, that automatically installs when you install anything from the tidyverse, called magrittr. Named after everyone’s favourite belgian surrealist Rene Margritte. The nights just fly when you’re out in the pub with a statistician.

Imagine you wanted to get all the data for the birth weights of babies from smoking mothers. That requires a selection() and a filter(). Here’s the normal way of doing that.

smoweight <- select(filter(babies, smoke==1), bwt)

Nothing wrong with that, but you have to read it from the inside out, so makes it a bit difficult to understand. You filter the dataset babies to have only the smokers and then you select the column of birthweights (bwt).

A pipe %>% says " put the answer of the left hand side command into the function on the right. Therefore the above command would look like

smoweight1 <- babies %>% filter(smoke==1) %>% select(bwt)

Confirm for yourself that these two dataframes are the same thing. I think pipes make it much easier to read code. The link above gives another advantage, they are much less computer intensive. This is important when your analyses get more complicated.

All perhaps a little too advanced, but too wonderful not to mention.

Data manipulation and normality testing