This is a brief walk through of using log and square root transformations. We’ll first plot some data that is non-normal, transforme it, then look at the effects of transformation on the plots and p-values from t-tests.

Functions used





Make the data

First, let’s put some data into R objects

diet.richness<- c(13, 22, 29, 22, 18, 21, 14, 14, 16, 12,25, 5,  9,  15, 10, 13, 11, 13, 21, 16,
6,  6,  15, 14, 13, 17, 15, 11, 11, 13,
11, 13, 12, 16, 11, 17, 18, 15, 13, 17,
15, 18, 15, 13, 12, 14, 12, 16, 15, 10,
21, 17, 15, 18)

stream <- c("POWD"  ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"LAUREL"   ,"POWD","POWD","POWD"   ,"POWD" ,"POWD" ,"POWD","POWD"  ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD")





Now, let’s package up these data into a dataframe using the ** data.frame() ** command.

df <- data.frame(stream = stream,
                 diet.richness = diet.richness)





Look at the data using summary

##     stream   diet.richness 
##  LAUREL:10   Min.   : 5.0  
##  POWD  :44   1st Qu.:12.0  
##              Median :14.5  
##              Mean   :14.7  
##              3rd Qu.:17.0  
##              Max.   :29.0





Look at the data using the st() “structure” command

ANother way to get a look at your data is using the str() command

str(df)
## 'data.frame':    54 obs. of  2 variables:
##  $ stream       : Factor w/ 2 levels "LAUREL","POWD": 2 1 1 1 1 1 1 1 1 1 ...
##  $ diet.richness: num  13 22 29 22 18 21 14 14 16 12 ...


This provides soome info about it. I don’t use this very often with dataframes, but the str() command comes in handy other times too.





Change Margins

This isn’t code you have to know, but I think it makes things easier for plotting. HEre, I use the par() command to tweak the maring (mar = ) to be a little smaller. THe defualts are c(5, 4, 4, 2) which results in some excessive white space around the entire graph (in my opinion).

par(mar = c(4,4,1.5,0.5))





Make a histogram

What do you notice about this histogram?



Make boxplots by stream



Transform the data

Log transformation

Use the ** log() ** command and the assignment operator ** <- ** to make a new column

df$log.richness <- log(df$diet.richness)



Square root transformation

Use the ** sqrt() ** command to make a new column

df$sqrt.richness <- sqrt(df$diet.richness)



Make histogram of transformed data

Log transformation

What do you notice about this histogram?



Square root transformation

What do you notice about this histogram?



Compare log and sqrt transformation with 2 histograms

This requires the command ** par(mfrow = c(1,2)) ** so that we get 2 plots side by side.

par(mfrow = c(1,2))
hist(df$log.richness, 
     xlab = "log(Diet Richness)",
     main = "log transform")

hist(df$sqrt.richness, 
     xlab = "sqrt(Diet Richness)",
     main = "sqrt transform")




Compare log, sqrt transformation AND original data with 3 histograms

This requires the command ** par(mfrow = c(1,3)) . NOte the c(1,3) , which sets the plot for 1 row of plots and 3** columns.

par(mfrow = c(1,3))
hist(df$diet.richness, 
     xlab = "Diet Richness",
     main = "raw data")
hist(df$log.richness, 
     xlab = "log(Diet Richness)",
     main = "log transform")

hist(df$sqrt.richness, 
     xlab = "sqrt(Diet Richness)",
     main = "sqrt transform")







Make boxplots by stream of logged data





Make boxplots by stream of square root data





Compare t-test on original w/raw data

First t.test: raw data

NOTE: for the sake of this exercise, set var.equal = T . Normally we would NOT do this

t.test.1 <- t.test(diet.richness ~ stream,
                   data = df,
                   var.equal = T) #NOTE = normally we use the default of var.equal=F
t.test.1
## 
##  Two Sample t-test
## 
## data:  diet.richness by stream
## t = 4.1207, df = 52, p-value = 0.000136
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.893987 8.387831
## sample estimates:
## mean in group LAUREL   mean in group POWD 
##             19.30000             13.65909



2nd t.test: logged data

What code carries this out?

## 
##  Two Sample t-test
## 
## data:  log(diet.richness) by stream
## t = 3.3317, df = 52, p-value = 0.001594
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1386412 0.5585456
## sample estimates:
## mean in group LAUREL   mean in group POWD 
##             2.923876             2.575283



3rd t.test: sqrt data

What code carries this out?

## 
##  Two Sample t-test
## 
## data:  sqrt(diet.richness) by stream
## t = 3.7579, df = 52, p-value = 0.0004348
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3223795 1.0611590
## sample estimates:
## mean in group LAUREL   mean in group POWD 
##             4.353863             3.662094



Examine results

Let’s extract the p values.

A useful advanced comman in R is str(). It shows you the underlying structure and organization of R objects.

We can use it to see what is all packeged up in our t-test results.

str(t.test.1)
## List of 9
##  $ statistic  : Named num 4.12
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 52
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.000136
##  $ conf.int   : atomic [1:2] 2.89 8.39
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 19.3 13.7
##   ..- attr(*, "names")= chr [1:2] "mean in group LAUREL" "mean in group POWD"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr " Two Sample t-test"
##  $ data.name  : chr "diet.richness by stream"
##  - attr(*, "class")= chr "htest"


This is all the raw info that gets formatted into a table by R from your t-test. We can access individual parts of the output like this

t.test.1$p.value
## [1] 0.0001359972


If we wanted the t-statistic we could do this

t.test.1$statistic
##        t 
## 4.120726



Let’s organize our 3 p values into a table to compare them.

ps <- data.frame(p= c(t.test.1$p.value,
                      t.test.2$p.value,
                      t.test.3$p.value),
                 t= c(t.test.1$statistic,
                      t.test.2$statistic,
                      t.test.3$statistic),
                 test = c("raw","log","sqrt"))



Now look at the output

ps
##              p        t test
## 1 0.0001359972 4.120726  raw
## 2 0.0015941985 3.331729  log
## 3 0.0004348352 3.757919 sqrt


We don’t really need this many decimal places so we can use the ** round() ** command to make thigns easier to read.

ps$p <- round(ps$p,5)
ps$t <- round(ps$t,2)