Introduction
- Important functions used
- Original Data
Load data
Clean variables
Save the cleaned data

Introduction

In this regression tutorial we’ll

load data from Sibiel et al 2013 in .csv format
fix some typos
clean up some columns of numeric data that contain charcter data
save the data as a .csv for future use

Important functions used

setwd
read.csv
grep
gsub
as.numeric
which
write.csv()

Original Data

Skibiel et al 2013. Journal of Animal Ecology. The evolution of the nutrient composition of mammalian milks. 82: 1254-1264.

Load data

Set working directory

“working directory” = The location of where I have the data.
Not the same as the “workspace”
Use RStudio to set your wd
- Click on “Session”
- “Set working directory”
- “Choose directory”
Code to get the working directory will appear in the console
Copy and paste this into a code chunk for easy future access

For my computer, this is the full “path” to my current R working directory.

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")

Load data from csv

I have cleaned the data and set it up in a csv file saved into the working directory.

dat <- read.csv(file = "Skibiel_clean_milk.csv")

Look at data

##          order  family                            spp mass.female
## 1 Artiodactyla Bovidae                  Bos frontalis      800000
## 2 Artiodactyla Bovidae                     Capra ibex       53000
## 3 Artiodactyla Bovidae Connocheates taurinus taurinus      170500
## 4 Artiodactyla Bovidae              Connocheates gnou      200000
## 5 Artiodactyla Bovidae  Damaliscus pygargus phillipsi       61000
## 6 Artiodactyla Bovidae                 Gazella dorcas       20600
##   gestation.month lacatation.months mass.litter repro.output
## 1            9.02               4.5       26949         0.03
## 2            5.60               7.5        3489         0.07
## 3            8.32               8.0       17717         0.10
## 4            8.50               7.5       11110         0.06
## 5            8.00               4.0        6500         0.11
## 6            4.74               2.8        1771         0.09
##   dev.stage.at.birth      diet arid       biome  N lactation.stage.orig
## 1                  3 herbivore   no terrestrial 4+                 <NA>
## 2                  3 herbivore   no terrestrial 24                30-60
## 3                  3 herbivore  yes terrestrial  5                  150
## 4                  3 herbivore  yes terrestrial  3                  150
## 5                  3 herbivore  yes terrestrial  4                  150
## 6                  3 herbivore  yes terrestrial 16                30-60
##   dry.matter  fat protein sugar energy                            ref
## 1       20.0  7.0     6.3   5.2   1.21       Oftedal & Iverson (1995)
## 2       23.3 12.4     5.7  <NA>   <NA>       Oftedal & Iverson (1995)
## 3       13.4  7.5     4.1   5.3   1.13 Osthoff, Hugo & de Wit (2009a)
## 4       12.0  5.5     4.3   4.1   0.91 Osthoff, Hugo & de Wit (2009a)
## 5       16.0  8.6     5.6   4.9   1.31 Osthoff, Hugo & de Wit (2009a)
## 6       24.1  8.8     8.8  <NA>   <NA>       Oftedal & Iverson (1995)
##   gest.month.NUM lacat.mo.NUM
## 1           9.02          4.5
## 2           5.60          7.5
## 3           8.32          8.0
## 4           8.50          7.5
## 5           8.00          4.0
## 6           4.74          2.8

##              order       family                     spp mass.female
## 125       Rodentia      Muridae     Pseudomys australis          65
## 126       Rodentia      Muridae       Rattus norvegicus         253
## 127       Rodentia Octodontidae           Octodon degus         235
## 128       Rodentia    Scuiridae          Tamias amoenus          53
## 129       Rodentia    Scuiridae Urocitellus columbianus         406
## 130 Soricomorpha11    Soricidae       Crocidura russula          14
##     gestation.month lacatation.months mass.litter repro.output
## 125            1.02               0.9          13         0.20
## 126            0.71               0.8          51         0.20
## 127            2.96               1.2          74         0.31
## 128            0.98               1.5          14         0.26
## 129            0.84               1.0          32         0.08
## 130            0.97               0.8           4         0.29
##     dev.stage.at.birth      diet arid       biome      N
## 125                  0 herbivore  yes terrestrial  7-Jun
## 126                  0  omnivore   no terrestrial 18-Mar
## 127                  3 herbivore  yes terrestrial      7
## 128                  0  omnivore   no terrestrial     11
## 129                  0 herbivore   no terrestrial     26
## 130                  0 carnivore   no terrestrial      3
##     lactation.stage.orig dry.matter  fat protein sugar energy
## 125               12-Jul       25.4 12.1     6.4   3.6   1.62
## 126               17-Aug       22.1  8.8    8.1*   3.8   1.43
## 127                15-21       30.5 20.1     4.4   2.7    2.2
## 128                15-20       36.7 21.7     8.1   4.3   2.62
## 129                   19       29.9  9.2    10.7   3.4    1.6
## 130               12-Aug       51.0 30.0     9.4     3    3.4
##                               ref gest.month.NUM lacat.mo.NUM
## 125      Oftedal & Iverson (1995)           1.02          0.9
## 126      Oftedal & Iverson (1995)           0.71          0.8
## 127        Veloso & Kenagy (2005)           2.96          1.2
## 128 Veloso, Place & Kenagy (2003)           0.98          1.5
## 129     Skibiel & Hood (in press)           0.84          1.0
## 130      Oftedal & Iverson (1995)           0.97          0.8

## [1] 130  22

##            order                family                        spp     
##  Artiodactyla :23   Bovidae        :13   Acomys cahirinus       :  1  
##  Carnivora    :23   Cercopithecidae: 8   Alces alces            :  1  
##  Primates     :22   Cervidae       : 7   Aloutta palliata       :  1  
##  Rodentia     :17   Muridae        : 7   Aloutta seniculus      :  1  
##  Chiroptera   :10   Otariidae      : 7   Arctocephalus australis:  1  
##  Diprotodontia:10   Phocidae       : 7   Arctocephalus gazella  :  1  
##  (Other)      :25   (Other)        :81   (Other)                :124  
##   mass.female        gestation.month  lacatation.months
##  Min.   :        8   Min.   : 0.400   Min.   : 0.300   
##  1st Qu.:      857   1st Qu.: 1.405   1st Qu.: 1.625   
##  Median :     5716   Median : 5.000   Median : 4.500   
##  Mean   :  2229475   Mean   : 5.624   Mean   : 6.092   
##  3rd Qu.:   107500   3rd Qu.: 8.365   3rd Qu.: 8.225   
##  Max.   :170000000   Max.   :21.460   Max.   :42.000   
##                                                        
##   mass.litter         repro.output     dev.stage.at.birth        diet   
##  Min.   :      0.3   Min.   :0.00003   Min.   :0.000      carnivore:32  
##  1st Qu.:     42.0   1st Qu.:0.04000   1st Qu.:1.000      herbivore:61  
##  Median :    423.5   Median :0.08000   Median :2.000      omnivore :37  
##  Mean   :  52563.8   Mean   :0.10374   Mean   :1.831                    
##  3rd Qu.:   7038.2   3rd Qu.:0.13750   3rd Qu.:3.000                    
##  Max.   :2272500.0   Max.   :0.50000   Max.   :4.000                    
##                                                                         
##   arid            biome           N      lactation.stage.orig
##  no :91   aquatic    : 22   4      :13   14-Aug :  3         
##  yes:39   terrestrial:108   3      :11   150    :  3         
##                             6      :10   21-63  :  3         
##                             7      : 9   30-60  :  3         
##                             5      : 8   i      :  3         
##                             24     : 5   (Other):113         
##                             (Other):74   NA's   :  2         
##    dry.matter         fat            protein        sugar        energy   
##  Min.   : 8.80   Min.   : 0.200   1.6*   :  4   3      : 5   1.44   :  4  
##  1st Qu.:16.27   1st Qu.: 4.575   1.5    :  3   4.5    : 5   0.81   :  3  
##  Median :22.75   Median : 8.550   10.3   :  3   5.3    : 5   1.13   :  3  
##  Mean   :27.06   Mean   :14.068   10.7   :  3   5.2    : 4   0.49   :  2  
##  3rd Qu.:32.05   3rd Qu.:17.575   6.3    :  3   6.6    : 4   0.5    :  2  
##  Max.   :71.10   Max.   :61.100   7      :  3   (Other):92   (Other):101  
##  NA's   :6       NA's   :2        (Other):111   NA's   :15   NA's   : 15  
##                                    ref     gest.month.NUM  
##  Oftedal & Iverson (1995)            :99   Min.   : 0.400  
##  Hood et al. (2001)                  : 4   1st Qu.: 1.405  
##  Osthoff, Hugo & de Wit (2009a)      : 3   Median : 5.000  
##  Derrickson, Jerrard & Oftedal (1996): 2   Mean   : 5.624  
##  Arnould & Boyd (1995)               : 1   3rd Qu.: 8.365  
##  Arnould & Hindell (1999)            : 1   Max.   :21.460  
##  (Other)                             :20                   
##   lacat.mo.NUM   
##  Min.   : 0.300  
##  1st Qu.: 1.625  
##  Median : 4.500  
##  Mean   : 6.092  
##  3rd Qu.: 8.225  
##  Max.   :42.000  
##

Fix NAs

After working with the data a bit I realized I had 2 NA values in my “fat” column that should’t be there. I use the following code to

Identify the location of the NAs with is.na()
Identify the specific row of data by using which() to ID the specific species with an NA for fat
Overwrite the NA with the correct data using row indexing
- ie “dat[i.H.niger,“fat”] <- 5.0”
Note that is.na() returns TRUE/FALSE values; “TRUE” acts like specific index when you use it for row indexing
- ie, even though “i.NA” is a vector of TRUES and FALSE, dat$spp[i.NA]" returns just the two species that match the “TRUES” Normally I would make these changes in my raw data file, but its useful to see how these changes can be made within R.

Fix NAs by hand

# Look at fat column
summary(dat$fat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.200   4.575   8.550  14.070  17.580  61.100       2

# Find NAS
i.NA <- is.na(dat$fat)

#Identify the species w/ NAs
dat$spp[i.NA]

## [1] Hippotragus niger         Cervus elaphus hispanicus
## 130 Levels: Acomys cahirinus Alces alces ... Zalophus californianus

# Extract the indexes of the appropriate species
i.H.niger <- which(dat$spp == "Hippotragus niger")
i.C.elaphus <- which(dat$spp == "Cervus elaphus hispanicus")

#Overwrite the bad values
dat[i.H.niger,"fat"] <- 5.0
dat[i.C.elaphus,"fat"] <- 12.6

Clean variables

There are some annotations from the raw data still hanging out in the dataframe
I will use “regular expressions” to clean these up
Regular expression take practice to use but there are many resources online for learning about them
- cleaning up asterisks (*), as below, is even a little tricker.
What follows is just a brief snapshot into this feature of R
- R is known, however, for not having stellar basic regular expression features.

Clean Protein column

There are letters, commas, and asterisks that were annotations in the original datafile. Easy to remove with find-replace, but also very easy with gsub(), with not chance of changing anything else in your spreadsheet you dont want

Look at all of the protein column

summary(dat$protein)

##  1.1*  1.3*   1.4  1.4*   1.5  1.5*   1.6  1.6*  1.7*   1.9    10 10.2, 
##     1     2     1     1     3     1     2     4     1     1     2     1 
##  10.3  10.5  10.7    11  11.1 11.2*  11.8  11.9  12.1  12.2  12.3  12.4 
##     3     2     3     2     1     1     1     1     1     1     1     1 
##  12.5  14.6  15.8   2.1   2.2   2.5   2.6   2.7     3   3.1   3.4   3.6 
##     2     1     1     2     1     1     1     2     1     1     1     2 
##  3.9*     4   4.1  4.2*   4.3  4.3*   4.4   4.5   4.6   4.8  4.8*   4.9 
##     1     2     1     1     1     1     2     2     1     1     1     1 
##   5.1  5.2*   5.4   5.5   5.6  5.6*   5.7   5.9     6   6.2   6.3  6.3* 
##     1     1     1     1     2     1     2     1     1     1     3     1 
##   6.4   6.5   6.6   6.7   6.8   6.9     7  7.1*   7.2   7.3   7.4   7.7 
##     1     2     1     1     1     1     3     1     2     2     1     2 
##   7.8  7.8S     8   8.1  8.1*   8.2   8.4  8.4*   8.5  8.6*  8.7*   8.8 
##     1     1     2     1     1     2     1     1     1     1     1     1 
##   8.9     9   9.1   9.2  9.2*   9.4   9.5   9.7  9.9* 
##     1     2     1     1     1     2     3     1     1

grep()

Use grep() to find the index of things
“[a-zA-Z]” = “find all letter, lower and uppercase”

i.letters <- grep("[a-zA-Z]",dat$protein)

dat$protein[i.letters]

## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

There’s jsut an “S”. I coudl find it like this with grep too

#these are equivalent for the simple case of a single letter
i.S <- grep("S",dat$protein)  #no brackets
i.S <- grep("[S]",dat$protein)#brackets

dat$protein[i.S]

## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

Use the gsub() command to replace letter

dat$protein <- gsub("[a-zA-Z]", #pattern
                    "",         #replace
                    dat$protein)#data

Check

grep("[a-zA-Z]",dat$protein)

## integer(0)

Use the gsub() command to replace asterisks *

Asterisks are a “special character” for regular expressions *These can be tricky
Have to put \ in front of it

dat$protein <- gsub("\\*","",dat$protein)

Check

grep("\\*",dat$protein)

## integer(0)

Use the gsub() command to replace comma

Commas are also species characters

dat$protein <- gsub("\\,","",dat$protein)

Check

grep("\\,",dat$protein)

## integer(0)

Check for space

These are frequent typos in data

grep(" ", dat$protein)

## integer(0)

None

Convert character data to numeric data

Because there were non-numeric characters (S, a comamn, *) in the column, are loaded “protein” as character data; basicly it treated it as words/symbols but not numbers.
We now need to convert it back to numbers
This is done with the command as.numeric()
Converting back and forth can somtimes cause problems, so its good to put the converted info in a new column to check against the old

dat$protein.num <- as.numeric(dat$protein)

Compare old and new columns

head(dat[,c("protein.num","protein")])

##   protein.num protein
## 1         6.3     6.3
## 2         5.7     5.7
## 3         4.1     4.1
## 4         4.3     4.3
## 5         5.6     5.6
## 6         8.8     8.8

tail(dat[,c("protein.num","protein")])

##     protein.num protein
## 125         6.4     6.4
## 126         8.1     8.1
## 127         4.4     4.4
## 128         8.1     8.1
## 129        10.7    10.7
## 130         9.4     9.4

Clean other columns

Check other columns

summary(dat[,c("sugar","energy")])

##      sugar        energy   
##  3      : 5   1.44   :  4  
##  4.5    : 5   0.81   :  3  
##  5.3    : 5   1.13   :  3  
##  5.2    : 4   0.49   :  2  
##  6.6    : 4   0.5    :  2  
##  (Other):92   (Other):101  
##  NA's   :15   NA's   : 15

Sugar

The sugar column has “<” and letters.
I can put anything I want removed inside the brackets
- “[a-zA-Z<]”" wiil get rid of the letters and “<”

dat$sugar <- gsub("[a-zA-Z<]","",dat$sugar)

Convert and check

dat$sugar.num <- as.numeric(dat$sugar)

## Warning: NAs introduced by coercion

head(dat[,c("sugar.num","sugar")])

##   sugar.num sugar
## 1       5.2   5.2
## 2        NA  <NA>
## 3       5.3   5.3
## 4       4.1   4.1
## 5       4.9   4.9
## 6        NA  <NA>

Energy

#The followign are equivalent
summary(dat[,c("energy")])

##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15

summary(dat$energy)

##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15

#clean
dat$energy <- gsub("[a-zA-Z<]","",dat$energy)

#convert and check
dat$energy.num <- as.numeric(dat$energy)

## Warning: NAs introduced by coercion

head(dat[,c("energy.num","energy")])

##   energy.num energy
## 1       1.21   1.21
## 2         NA   <NA>
## 3       1.13   1.13
## 4       0.91   0.91
## 5       1.31   1.31
## 6         NA   <NA>

Save the cleaned data

write.csv(dat, "Skibiel_clean2_milk.csv")

HW 4: Regression Case Study Tutorial 1: Data Cleaning

brouwern@gmail.com

February 12, 2017

Introduction

Important functions used

Original Data

Load data

Set working directory

Load data from csv

Look at data

Fix NAs

Fix NAs by hand

Clean variables

Clean Protein column

grep()

Use the gsub() command to replace letter

Use the gsub() command to replace asterisks *

Use the gsub() command to replace comma

Check for space

Convert character data to numeric data

Clean other columns

Sugar

Energy

Save the cleaned data