Introduction

In this regression tutorial we’ll

Important functions used

  • setwd
  • read.csv
  • grep
  • gsub
  • as.numeric
  • which
  • write.csv()

Original Data

Skibiel et al 2013. Journal of Animal Ecology. The evolution of the nutrient composition of mammalian milks. 82: 1254-1264.

Load data

Set working directory

  • “working directory” = The location of where I have the data.
  • Not the same as the “workspace”
  • Use RStudio to set your wd
    • Click on “Session”
    • “Set working directory”
    • “Choose directory”
  • Code to get the working directory will appear in the console
  • Copy and paste this into a code chunk for easy future access

For my computer, this is the full “path” to my current R working directory.

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")

Load data from csv

I have cleaned the data and set it up in a csv file saved into the working directory.

dat <- read.csv(file = "Skibiel_clean_milk.csv")

Look at data

##          order  family                            spp mass.female
## 1 Artiodactyla Bovidae                  Bos frontalis      800000
## 2 Artiodactyla Bovidae                     Capra ibex       53000
## 3 Artiodactyla Bovidae Connocheates taurinus taurinus      170500
## 4 Artiodactyla Bovidae              Connocheates gnou      200000
## 5 Artiodactyla Bovidae  Damaliscus pygargus phillipsi       61000
## 6 Artiodactyla Bovidae                 Gazella dorcas       20600
##   gestation.month lacatation.months mass.litter repro.output
## 1            9.02               4.5       26949         0.03
## 2            5.60               7.5        3489         0.07
## 3            8.32               8.0       17717         0.10
## 4            8.50               7.5       11110         0.06
## 5            8.00               4.0        6500         0.11
## 6            4.74               2.8        1771         0.09
##   dev.stage.at.birth      diet arid       biome  N lactation.stage.orig
## 1                  3 herbivore   no terrestrial 4+                 <NA>
## 2                  3 herbivore   no terrestrial 24                30-60
## 3                  3 herbivore  yes terrestrial  5                  150
## 4                  3 herbivore  yes terrestrial  3                  150
## 5                  3 herbivore  yes terrestrial  4                  150
## 6                  3 herbivore  yes terrestrial 16                30-60
##   dry.matter  fat protein sugar energy                            ref
## 1       20.0  7.0     6.3   5.2   1.21       Oftedal & Iverson (1995)
## 2       23.3 12.4     5.7  <NA>   <NA>       Oftedal & Iverson (1995)
## 3       13.4  7.5     4.1   5.3   1.13 Osthoff, Hugo & de Wit (2009a)
## 4       12.0  5.5     4.3   4.1   0.91 Osthoff, Hugo & de Wit (2009a)
## 5       16.0  8.6     5.6   4.9   1.31 Osthoff, Hugo & de Wit (2009a)
## 6       24.1  8.8     8.8  <NA>   <NA>       Oftedal & Iverson (1995)
##   gest.month.NUM lacat.mo.NUM
## 1           9.02          4.5
## 2           5.60          7.5
## 3           8.32          8.0
## 4           8.50          7.5
## 5           8.00          4.0
## 6           4.74          2.8
##              order       family                     spp mass.female
## 125       Rodentia      Muridae     Pseudomys australis          65
## 126       Rodentia      Muridae       Rattus norvegicus         253
## 127       Rodentia Octodontidae           Octodon degus         235
## 128       Rodentia    Scuiridae          Tamias amoenus          53
## 129       Rodentia    Scuiridae Urocitellus columbianus         406
## 130 Soricomorpha11    Soricidae       Crocidura russula          14
##     gestation.month lacatation.months mass.litter repro.output
## 125            1.02               0.9          13         0.20
## 126            0.71               0.8          51         0.20
## 127            2.96               1.2          74         0.31
## 128            0.98               1.5          14         0.26
## 129            0.84               1.0          32         0.08
## 130            0.97               0.8           4         0.29
##     dev.stage.at.birth      diet arid       biome      N
## 125                  0 herbivore  yes terrestrial  7-Jun
## 126                  0  omnivore   no terrestrial 18-Mar
## 127                  3 herbivore  yes terrestrial      7
## 128                  0  omnivore   no terrestrial     11
## 129                  0 herbivore   no terrestrial     26
## 130                  0 carnivore   no terrestrial      3
##     lactation.stage.orig dry.matter  fat protein sugar energy
## 125               12-Jul       25.4 12.1     6.4   3.6   1.62
## 126               17-Aug       22.1  8.8    8.1*   3.8   1.43
## 127                15-21       30.5 20.1     4.4   2.7    2.2
## 128                15-20       36.7 21.7     8.1   4.3   2.62
## 129                   19       29.9  9.2    10.7   3.4    1.6
## 130               12-Aug       51.0 30.0     9.4     3    3.4
##                               ref gest.month.NUM lacat.mo.NUM
## 125      Oftedal & Iverson (1995)           1.02          0.9
## 126      Oftedal & Iverson (1995)           0.71          0.8
## 127        Veloso & Kenagy (2005)           2.96          1.2
## 128 Veloso, Place & Kenagy (2003)           0.98          1.5
## 129     Skibiel & Hood (in press)           0.84          1.0
## 130      Oftedal & Iverson (1995)           0.97          0.8
## [1] 130  22
##            order                family                        spp     
##  Artiodactyla :23   Bovidae        :13   Acomys cahirinus       :  1  
##  Carnivora    :23   Cercopithecidae: 8   Alces alces            :  1  
##  Primates     :22   Cervidae       : 7   Aloutta palliata       :  1  
##  Rodentia     :17   Muridae        : 7   Aloutta seniculus      :  1  
##  Chiroptera   :10   Otariidae      : 7   Arctocephalus australis:  1  
##  Diprotodontia:10   Phocidae       : 7   Arctocephalus gazella  :  1  
##  (Other)      :25   (Other)        :81   (Other)                :124  
##   mass.female        gestation.month  lacatation.months
##  Min.   :        8   Min.   : 0.400   Min.   : 0.300   
##  1st Qu.:      857   1st Qu.: 1.405   1st Qu.: 1.625   
##  Median :     5716   Median : 5.000   Median : 4.500   
##  Mean   :  2229475   Mean   : 5.624   Mean   : 6.092   
##  3rd Qu.:   107500   3rd Qu.: 8.365   3rd Qu.: 8.225   
##  Max.   :170000000   Max.   :21.460   Max.   :42.000   
##                                                        
##   mass.litter         repro.output     dev.stage.at.birth        diet   
##  Min.   :      0.3   Min.   :0.00003   Min.   :0.000      carnivore:32  
##  1st Qu.:     42.0   1st Qu.:0.04000   1st Qu.:1.000      herbivore:61  
##  Median :    423.5   Median :0.08000   Median :2.000      omnivore :37  
##  Mean   :  52563.8   Mean   :0.10374   Mean   :1.831                    
##  3rd Qu.:   7038.2   3rd Qu.:0.13750   3rd Qu.:3.000                    
##  Max.   :2272500.0   Max.   :0.50000   Max.   :4.000                    
##                                                                         
##   arid            biome           N      lactation.stage.orig
##  no :91   aquatic    : 22   4      :13   14-Aug :  3         
##  yes:39   terrestrial:108   3      :11   150    :  3         
##                             6      :10   21-63  :  3         
##                             7      : 9   30-60  :  3         
##                             5      : 8   i      :  3         
##                             24     : 5   (Other):113         
##                             (Other):74   NA's   :  2         
##    dry.matter         fat            protein        sugar        energy   
##  Min.   : 8.80   Min.   : 0.200   1.6*   :  4   3      : 5   1.44   :  4  
##  1st Qu.:16.27   1st Qu.: 4.575   1.5    :  3   4.5    : 5   0.81   :  3  
##  Median :22.75   Median : 8.550   10.3   :  3   5.3    : 5   1.13   :  3  
##  Mean   :27.06   Mean   :14.068   10.7   :  3   5.2    : 4   0.49   :  2  
##  3rd Qu.:32.05   3rd Qu.:17.575   6.3    :  3   6.6    : 4   0.5    :  2  
##  Max.   :71.10   Max.   :61.100   7      :  3   (Other):92   (Other):101  
##  NA's   :6       NA's   :2        (Other):111   NA's   :15   NA's   : 15  
##                                    ref     gest.month.NUM  
##  Oftedal & Iverson (1995)            :99   Min.   : 0.400  
##  Hood et al. (2001)                  : 4   1st Qu.: 1.405  
##  Osthoff, Hugo & de Wit (2009a)      : 3   Median : 5.000  
##  Derrickson, Jerrard & Oftedal (1996): 2   Mean   : 5.624  
##  Arnould & Boyd (1995)               : 1   3rd Qu.: 8.365  
##  Arnould & Hindell (1999)            : 1   Max.   :21.460  
##  (Other)                             :20                   
##   lacat.mo.NUM   
##  Min.   : 0.300  
##  1st Qu.: 1.625  
##  Median : 4.500  
##  Mean   : 6.092  
##  3rd Qu.: 8.225  
##  Max.   :42.000  
## 


Fix NAs

After working with the data a bit I realized I had 2 NA values in my “fat” column that should’t be there. I use the following code to

  • Identify the location of the NAs with is.na()
  • Identify the specific row of data by using which() to ID the specific species with an NA for fat
  • Overwrite the NA with the correct data using row indexing
    • ie “dat[i.H.niger,“fat”] <- 5.0”
  • Note that is.na() returns TRUE/FALSE values; “TRUE” acts like specific index when you use it for row indexing
    • ie, even though “i.NA” is a vector of TRUES and FALSE, dat$spp[i.NA]" returns just the two species that match the “TRUES” Normally I would make these changes in my raw data file, but its useful to see how these changes can be made within R.

Fix NAs by hand

# Look at fat column
summary(dat$fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.200   4.575   8.550  14.070  17.580  61.100       2
# Find NAS
i.NA <- is.na(dat$fat)

#Identify the species w/ NAs
dat$spp[i.NA]
## [1] Hippotragus niger         Cervus elaphus hispanicus
## 130 Levels: Acomys cahirinus Alces alces ... Zalophus californianus
# Extract the indexes of the appropriate species
i.H.niger <- which(dat$spp == "Hippotragus niger")
i.C.elaphus <- which(dat$spp == "Cervus elaphus hispanicus")

#Overwrite the bad values
dat[i.H.niger,"fat"] <- 5.0
dat[i.C.elaphus,"fat"] <- 12.6

Clean variables

Clean Protein column

There are letters, commas, and asterisks that were annotations in the original datafile. Easy to remove with find-replace, but also very easy with gsub(), with not chance of changing anything else in your spreadsheet you dont want

Look at all of the protein column

summary(dat$protein)
##  1.1*  1.3*   1.4  1.4*   1.5  1.5*   1.6  1.6*  1.7*   1.9    10 10.2, 
##     1     2     1     1     3     1     2     4     1     1     2     1 
##  10.3  10.5  10.7    11  11.1 11.2*  11.8  11.9  12.1  12.2  12.3  12.4 
##     3     2     3     2     1     1     1     1     1     1     1     1 
##  12.5  14.6  15.8   2.1   2.2   2.5   2.6   2.7     3   3.1   3.4   3.6 
##     2     1     1     2     1     1     1     2     1     1     1     2 
##  3.9*     4   4.1  4.2*   4.3  4.3*   4.4   4.5   4.6   4.8  4.8*   4.9 
##     1     2     1     1     1     1     2     2     1     1     1     1 
##   5.1  5.2*   5.4   5.5   5.6  5.6*   5.7   5.9     6   6.2   6.3  6.3* 
##     1     1     1     1     2     1     2     1     1     1     3     1 
##   6.4   6.5   6.6   6.7   6.8   6.9     7  7.1*   7.2   7.3   7.4   7.7 
##     1     2     1     1     1     1     3     1     2     2     1     2 
##   7.8  7.8S     8   8.1  8.1*   8.2   8.4  8.4*   8.5  8.6*  8.7*   8.8 
##     1     1     2     1     1     2     1     1     1     1     1     1 
##   8.9     9   9.1   9.2  9.2*   9.4   9.5   9.7  9.9* 
##     1     2     1     1     1     2     3     1     1

grep()

  • Use grep() to find the index of things
  • “[a-zA-Z]” = “find all letter, lower and uppercase”
i.letters <- grep("[a-zA-Z]",dat$protein)

dat$protein[i.letters]
## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

There’s jsut an “S”. I coudl find it like this with grep too

#these are equivalent for the simple case of a single letter
i.S <- grep("S",dat$protein)  #no brackets
i.S <- grep("[S]",dat$protein)#brackets

dat$protein[i.S]
## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

Use the gsub() command to replace letter

dat$protein <- gsub("[a-zA-Z]", #pattern
                    "",         #replace
                    dat$protein)#data

Check

grep("[a-zA-Z]",dat$protein)
## integer(0)

Use the gsub() command to replace asterisks *

  • Asterisks are a “special character” for regular expressions *These can be tricky
  • Have to put \ in front of it
dat$protein <- gsub("\\*","",dat$protein)

Check

grep("\\*",dat$protein)
## integer(0)

Use the gsub() command to replace comma

Commas are also species characters

dat$protein <- gsub("\\,","",dat$protein)

Check

grep("\\,",dat$protein)
## integer(0)

Check for space

These are frequent typos in data

grep(" ", dat$protein)
## integer(0)

None

Convert character data to numeric data

  • Because there were non-numeric characters (S, a comamn, *) in the column, are loaded “protein” as character data; basicly it treated it as words/symbols but not numbers.
  • We now need to convert it back to numbers
  • This is done with the command as.numeric()
  • Converting back and forth can somtimes cause problems, so its good to put the converted info in a new column to check against the old
dat$protein.num <- as.numeric(dat$protein)

Compare old and new columns

head(dat[,c("protein.num","protein")])
##   protein.num protein
## 1         6.3     6.3
## 2         5.7     5.7
## 3         4.1     4.1
## 4         4.3     4.3
## 5         5.6     5.6
## 6         8.8     8.8
tail(dat[,c("protein.num","protein")])
##     protein.num protein
## 125         6.4     6.4
## 126         8.1     8.1
## 127         4.4     4.4
## 128         8.1     8.1
## 129        10.7    10.7
## 130         9.4     9.4

Clean other columns

Check other columns

summary(dat[,c("sugar","energy")])
##      sugar        energy   
##  3      : 5   1.44   :  4  
##  4.5    : 5   0.81   :  3  
##  5.3    : 5   1.13   :  3  
##  5.2    : 4   0.49   :  2  
##  6.6    : 4   0.5    :  2  
##  (Other):92   (Other):101  
##  NA's   :15   NA's   : 15

Sugar

  • The sugar column has “<” and letters.
  • I can put anything I want removed inside the brackets
    • “[a-zA-Z<]”" wiil get rid of the letters and “<”
dat$sugar <- gsub("[a-zA-Z<]","",dat$sugar)

Convert and check

dat$sugar.num <- as.numeric(dat$sugar)
## Warning: NAs introduced by coercion
head(dat[,c("sugar.num","sugar")])
##   sugar.num sugar
## 1       5.2   5.2
## 2        NA  <NA>
## 3       5.3   5.3
## 4       4.1   4.1
## 5       4.9   4.9
## 6        NA  <NA>

Energy

#The followign are equivalent
summary(dat[,c("energy")])
##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15
summary(dat$energy)
##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15
#clean
dat$energy <- gsub("[a-zA-Z<]","",dat$energy)

#convert and check
dat$energy.num <- as.numeric(dat$energy)
## Warning: NAs introduced by coercion
head(dat[,c("energy.num","energy")])
##   energy.num energy
## 1       1.21   1.21
## 2         NA   <NA>
## 3       1.13   1.13
## 4       0.91   0.91
## 5       1.31   1.31
## 6         NA   <NA>

Save the cleaned data

write.csv(dat, "Skibiel_clean2_milk.csv")