This is a brief walk through of using log and square root transformations. We’ll first plot some data that is non-normal, transforme it, then look at the effects of transformation on the plots and p-values from t-tests.
First, let’s put some data into R objects
diet.richness<- c(13, 22, 29, 22, 18, 21, 14, 14, 16, 12,25, 5, 9, 15, 10, 13, 11, 13, 21, 16,
6, 6, 15, 14, 13, 17, 15, 11, 11, 13,
11, 13, 12, 16, 11, 17, 18, 15, 13, 17,
15, 18, 15, 13, 12, 14, 12, 16, 15, 10,
21, 17, 15, 18)
stream <- c("POWD" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"LAUREL" ,"POWD","POWD","POWD" ,"POWD" ,"POWD" ,"POWD","POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD" ,"POWD")
Now, let’s package up these data into a dataframe using the ** data.frame() ** command.
df <- data.frame(stream = stream,
diet.richness = diet.richness)
## stream diet.richness
## LAUREL:10 Min. : 5.0
## POWD :44 1st Qu.:12.0
## Median :14.5
## Mean :14.7
## 3rd Qu.:17.0
## Max. :29.0
ANother way to get a look at your data is using the str() command
str(df)
## 'data.frame': 54 obs. of 2 variables:
## $ stream : Factor w/ 2 levels "LAUREL","POWD": 2 1 1 1 1 1 1 1 1 1 ...
## $ diet.richness: num 13 22 29 22 18 21 14 14 16 12 ...
This provides soome info about it. I don’t use this very often with dataframes, but the str() command comes in handy other times too.
This isn’t code you have to know, but I think it makes things easier for plotting. HEre, I use the par() command to tweak the maring (mar = ) to be a little smaller. THe defualts are c(5, 4, 4, 2) which results in some excessive white space around the entire graph (in my opinion).
par(mar = c(4,4,1.5,0.5))
What do you notice about this histogram?
Use the ** log() ** command and the assignment operator ** <- ** to make a new column
df$log.richness <- log(df$diet.richness)
Use the ** sqrt() ** command to make a new column
df$sqrt.richness <- sqrt(df$diet.richness)
What do you notice about this histogram?
What do you notice about this histogram?
This requires the command ** par(mfrow = c(1,2)) ** so that we get 2 plots side by side.
par(mfrow = c(1,2))
hist(df$log.richness,
xlab = "log(Diet Richness)",
main = "log transform")
hist(df$sqrt.richness,
xlab = "sqrt(Diet Richness)",
main = "sqrt transform")
This requires the command ** par(mfrow = c(1,3)) . NOte the c(1,3) , which sets the plot for 1 row of plots and 3** columns.
par(mfrow = c(1,3))
hist(df$diet.richness,
xlab = "Diet Richness",
main = "raw data")
hist(df$log.richness,
xlab = "log(Diet Richness)",
main = "log transform")
hist(df$sqrt.richness,
xlab = "sqrt(Diet Richness)",
main = "sqrt transform")
NOTE: for the sake of this exercise, set var.equal = T . Normally we would NOT do this
t.test.1 <- t.test(diet.richness ~ stream,
data = df,
var.equal = T) #NOTE = normally we use the default of var.equal=F
t.test.1
##
## Two Sample t-test
##
## data: diet.richness by stream
## t = 4.1207, df = 52, p-value = 0.000136
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.893987 8.387831
## sample estimates:
## mean in group LAUREL mean in group POWD
## 19.30000 13.65909
What code carries this out?
##
## Two Sample t-test
##
## data: log(diet.richness) by stream
## t = 3.3317, df = 52, p-value = 0.001594
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1386412 0.5585456
## sample estimates:
## mean in group LAUREL mean in group POWD
## 2.923876 2.575283
What code carries this out?
##
## Two Sample t-test
##
## data: sqrt(diet.richness) by stream
## t = 3.7579, df = 52, p-value = 0.0004348
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3223795 1.0611590
## sample estimates:
## mean in group LAUREL mean in group POWD
## 4.353863 3.662094
Let’s extract the p values.
A useful advanced comman in R is str(). It shows you the underlying structure and organization of R objects.
We can use it to see what is all packeged up in our t-test results.
str(t.test.1)
## List of 9
## $ statistic : Named num 4.12
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 52
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.000136
## $ conf.int : atomic [1:2] 2.89 8.39
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 19.3 13.7
## ..- attr(*, "names")= chr [1:2] "mean in group LAUREL" "mean in group POWD"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr " Two Sample t-test"
## $ data.name : chr "diet.richness by stream"
## - attr(*, "class")= chr "htest"
This is all the raw info that gets formatted into a table by R from your t-test. We can access individual parts of the output like this
t.test.1$p.value
## [1] 0.0001359972
If we wanted the t-statistic we could do this
t.test.1$statistic
## t
## 4.120726
Let’s organize our 3 p values into a table to compare them.
ps <- data.frame(p= c(t.test.1$p.value,
t.test.2$p.value,
t.test.3$p.value),
t= c(t.test.1$statistic,
t.test.2$statistic,
t.test.3$statistic),
test = c("raw","log","sqrt"))
Now look at the output
ps
## p t test
## 1 0.0001359972 4.120726 raw
## 2 0.0015941985 3.331729 log
## 3 0.0004348352 3.757919 sqrt
We don’t really need this many decimal places so we can use the ** round() ** command to make thigns easier to read.
ps$p <- round(ps$p,5)
ps$t <- round(ps$t,2)