Data exploration with ggplot
- Set working directory
- Load cleaned data
Add the Australia information again
Subset data for further analyses
Data exploration in ggplot2
Hard code transformation
Colors
Shape
Colors and shape
Facets
Save data

Data exploration with ggplot

Set working directory

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")

Load cleaned data

Load data cleaned in previous tutorial

dat <- read.csv(file = "Skibiel_clean2_milk.csv")

Add the Australia information again

In the PCA tutorial we determined that mammals from Australia are outliers in some respects. We’ll make a new column to indicate which species these are.

i.weirdos <- which(dat$order %in% c("Monotremata"  #monotremes
,"Diprotodontia"  #marsupials
,"Didelphimorphia"#opossums
,"Dasyuromorphia"#tasmanian devil)
,"Peramelemorphia" #bandicoots
)) 

dat$Australia <- "other"

dat$Australia[i.weirdos] <- "Australia"

Subset data for further analyses

We won’t need all of the columns for our regression analysis, so I’ll subset it using “column indexing”
I’ll make an “index” object (i.columns) to hold the names of the columns I want to use
I’ll use that index in a set of square brackets to subset the data
- “dat[ , i.columns]”

Define columns of interest

i.columns <- c("fat",
               "protein.num",
               "sugar.num",
               "energy.num",
               "mass.female",
               "gestation.month",
               "lacatation.months", 
               "mass.litter",
               "repro.output",
               "dev.stage.at.birth",
               "diet","arid",
               "biome",
               "Australia",
               "order",
               "family",
               "spp")

Subset data using column names

dat2 <- dat[ , i.columns]

Look at subset

dim(dat2)

## [1] 130  17

summary(dat2)

##       fat         protein.num       sugar.num       energy.num   
##  Min.   : 0.20   Min.   : 1.100   Min.   : 0.02   Min.   :0.360  
##  1st Qu.: 4.65   1st Qu.: 4.125   1st Qu.: 3.00   1st Qu.:0.965  
##  Median : 8.55   Median : 6.750   Median : 4.70   Median :1.365  
##  Mean   :13.99   Mean   : 6.673   Mean   : 4.94   Mean   :1.680  
##  3rd Qu.:16.82   3rd Qu.: 9.200   3rd Qu.: 6.60   3rd Qu.:2.045  
##  Max.   :61.10   Max.   :15.800   Max.   :14.00   Max.   :5.890  
##                                   NA's   :16      NA's   :16     
##   mass.female        gestation.month  lacatation.months
##  Min.   :        8   Min.   : 0.400   Min.   : 0.300   
##  1st Qu.:      857   1st Qu.: 1.405   1st Qu.: 1.625   
##  Median :     5716   Median : 5.000   Median : 4.500   
##  Mean   :  2229475   Mean   : 5.624   Mean   : 6.092   
##  3rd Qu.:   107500   3rd Qu.: 8.365   3rd Qu.: 8.225   
##  Max.   :170000000   Max.   :21.460   Max.   :42.000   
##                                                        
##   mass.litter         repro.output     dev.stage.at.birth        diet   
##  Min.   :      0.3   Min.   :0.00003   Min.   :0.000      carnivore:32  
##  1st Qu.:     42.0   1st Qu.:0.04000   1st Qu.:1.000      herbivore:61  
##  Median :    423.5   Median :0.08000   Median :2.000      omnivore :37  
##  Mean   :  52563.8   Mean   :0.10374   Mean   :1.831                    
##  3rd Qu.:   7038.2   3rd Qu.:0.13750   3rd Qu.:3.000                    
##  Max.   :2272500.0   Max.   :0.50000   Max.   :4.000                    
##                                                                         
##   arid            biome      Australia                   order   
##  no :91   aquatic    : 22   Length:130         Artiodactyla :23  
##  yes:39   terrestrial:108   Class :character   Carnivora    :23  
##                             Mode  :character   Primates     :22  
##                                                Rodentia     :17  
##                                                Chiroptera   :10  
##                                                Diprotodontia:10  
##                                                (Other)      :25  
##              family                        spp     
##  Bovidae        :13   Acomys cahirinus       :  1  
##  Cercopithecidae: 8   Alces alces            :  1  
##  Cervidae       : 7   Aloutta palliata       :  1  
##  Muridae        : 7   Aloutta seniculus      :  1  
##  Otariidae      : 7   Arctocephalus australis:  1  
##  Phocidae       : 7   Arctocephalus gazella  :  1  
##  (Other)        :81   (Other)                :124

Data exploration in ggplot2

Our focal response (y) variable will “fat”, the percentage of fat in each species of mammal’s milk
We’ll use female body size to take a look at the data

Plot raw data in base graphics

#plot(fat ~ mass.female, data = dat2)

Plot raw data in ggplot

We’ll use the ggplot function qplot()
Note that ggplot does NOT use R’s formula notation

library(ggplot2)
qplot(y = fat,
      x = mass.female, 
      data = dat2)

Log transform x variable

The whales really throw off the x-axis!

qplot(y = fat,
      x = log10(mass.female), 
      data = dat2)

Log transform x and y variable

Authors state that “All continuous variables were log10-transformed prior to analysis.”
Log10 transformation is common in biology in general, and especialy when things like body size are being considered.

qplot(y = log10(fat),
      x = log10(mass.female), 
      data = dat2)

Log transform x, logit transform y variable

The data are percentages and therefore the logit is a more appropriate transformation
The car package has a logit() command
- car::logit() can accept proportions ([0,1]) or percentages ([0,100])
- the fat data are percentages so we use the arguement “percents = TRUE”"

library(car)

qplot(y = logit(fat, percents = T),
      x = log10(mass.female), 
      data = dat2)

Hard code transformation

To make code cleaner make new columns with the transformed data

#x variable fat
## The authors' transformation of the
dat2$fat.log10 <- log10(dat2$fat)

## The logit transformation
dat2$fat.logit <-logit(dat2$fat, percents = T)
  
#y variable female mass  
dat2$mass.log10 <- log10(dat2$mass.female)

Colors

library(car)

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = Australia)

Shape

library(car)

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      shape = diet)

Colors and shape

library(car)

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = Australia,
      shape = diet)

Facets

Plot different groups on sep. panels.

This allows you to explore structure of data. This reveals something interesting - aquatic carnivores seem to have a positive fat~mass relationship, while terrestrial carnivores seem to have a negative one. This is interesting, since the authors found no overal significant relationship between body size and milk composition, but they did not explore any interactions. This would be difficult to model with the given sample size though b/c it imlies a massdietaquatic interaction. Since all of the aquatic animals in the dataset are carnivores, the carnivore diet category could be split into terrestrial and aquatic groups.

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = Australia,
      shape = diet,
      facets = biome ~ diet)

Note the outliers the hand off the bottom of two of the three lower panels. I’ll use colors to see what they are

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = order,
      shape = diet,
      facets = biome ~ diet)

In the herbivore-terrestrial panel it appears like Perrissodactyla all form a line. These are odd-toed ungulates like rhinos and horses. The really low triangle I believe is a rhino. IN the Omnivore-terrestrial panel it looks like clump of primates are outliers. I’m going to guess these are lemurs.

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = family,
      shape = diet,
      facets = biome ~ diet)

Make an index for lemurs

i.lemur <- which(dat2$family == "Lemuridae")

dat2$lemurs <- "not.a.lemur"
dat2$lemurs[i.lemur] <- "lemur"

Yup, its the lemurs. This indicates that for their size they have really lean milk (low fat)

qplot(y = fat.log10,
      x = mass.log10, 
      data = dat2,
      color = lemurs,
      shape = diet,
      facets = biome ~ diet)

Save data

Save our data subset

write.csv(dat2,"milk_subset.csv")

Duq Biostats Week 6 | Regression Tutorial 3: Data exploration with ggplot

brouwern@gmail.com

February 12, 2017