DataM: Inclass Exercise 0330

The data set concerns species and weight of animals caught in plots in a study area in Arizona over time.

Each row holds information for a single animal, and the columns represent:

record_id: Unique id for the observation
month: month of observation
day: day of observation
year: year of observation
plot_id: ID of a particular plot
species_id: 2-letter code
sex: sex of animal ("M", "F")
hindfoot_length: length of the hindfoot in mm
weight: weight of the animal in grams
genus: genus of animal
species: species of animal
taxa: e.g. Rodent, Reptile, Bird, Rabbit
plot_type: type of plot

Chunk 1

pacman::p_load(tidyverse)

Load the package tidyverse by using pacman.

Chunk 2

dta <- read_csv("http://kbroman.org/datacarp/portal_data_joined.csv")

## Parsed with column specification:
## cols(
##   record_id = col_double(),
##   month = col_double(),
##   day = col_double(),
##   year = col_double(),
##   plot_id = col_double(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_double(),
##   weight = col_double(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

Load in the comma-delimited data set via the URL by using read_csv{readr} and name the dataset dta.

Chunk 3

glimpse(dta)

## Rows: 34,786
## Columns: 13
## $ record_id       <dbl> 1, 72, 224, 266, 349, 363, 435, 506, 588, 661, …
## $ month           <dbl> 7, 8, 9, 10, 11, 11, 12, 1, 2, 3, 4, 5, 6, 8, 9…
## $ day             <dbl> 16, 19, 13, 16, 12, 12, 10, 8, 18, 11, 8, 6, 9,…
## $ year            <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1978,…
## $ plot_id         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id      <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL",…
## $ sex             <chr> "M", "M", NA, NA, NA, NA, NA, NA, "M", NA, NA, …
## $ hindfoot_length <dbl> 32, 31, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32,…
## $ weight          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 218, NA, NA, 20…
## $ genus           <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Ne…
## $ species         <chr> "albigula", "albigula", "albigula", "albigula",…
## $ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent…
## $ plot_type       <chr> "Control", "Control", "Control", "Control", "Co…

Get the basic info of dta, including its dimension and names of variables.

Chunk 4

dim(dta)

## [1] 34786    13

Get the dimension of dta (no. of rows and no. of columns).

Chunk 5

dplyr::select(dta, plot_id, species_id, weight) %>% head()

Use select{dplyr} to pick up plot_id, species_id, and weightsome, variables in dta. And use head to display the first 6 rows.

Chunk 6

dplyr::select(dta, -record_id, -species_id) %>% head()

Use select{dplyr} to pick up variables in dta except record_id and species_id. And use head to display the first 6 rows.

Chunk 7

dplyr::filter(dta, year == 1995) %>% head()

Use filter{dplyr} to pick up rows that correspond the specified condition (e.g., data in variable year is 1995). And use head to display the first 6 rows.

Chunk 8

head(dplyr::select(dplyr::filter(dta, weight <= 5), species_id, sex, weight))

Use filter{dplyr} to pick up rows which data in variable weight is not larger than 5.
Use select{dplyr} to pick up variables species_id, sex, and weight.
Use head to display the first 6 rows.

Chunk 9

dta %>% 
  dplyr::filter(weight <= 5) %>% 
  dplyr::select(species_id, sex, weight) %>% 
  head

Use filter{dplyr} to pick up rows which data in variable weight is not larger than 5.
Use select{dplyr} to pick up variables species_id, sex, and weight.
Use head to display the first 6 rows and name it dta.

Chunk 10

dta %>% 
  mutate(weight_kg = weight / 1000,
         weight_lb = weight_kg * 2.2) %>% 
  head()

Use mutate to create two new variables: (a) weight_kg: the existing variable, weight, divied by 1000. (b) weight_lb: the new-creating variable, weight_kg, multiply 2.2. (In other words, this procedure is conducting unit conversion.) And use head to display the first 6 rows.

Chunk 11

dta %>% 
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight)) %>%
  arrange(desc(mean_weight)) %>% 
  head()

Use filter{dplyr} to pick up rows which data in variable weight is not a missing value.
Use group_by to group the data by variables sex and species_id. There will be # classes in sex * # classes in species_id groups.
Use summarize to compute weight means of each group.
Use arrange and desc to sort the data the descending order of weight means.
Use head to display the first 6 rows.

Chunk 12

dta %>%
  group_by(sex) %>%
  tally

Group the data by the variable sex and count total observations in each class of group. That is, count observations in each class of sex.

Chunk 13

dta %>%
  count(sex)

Count observations in each class of sex.

Chunk 14

dta %>%
  group_by(sex) %>%
  summarize(count = n())

Group the data by the variable sex and create a new varibale with total observations in each class of group. That is, count observations in each class of sex.

Chunk 15

dta %>%
  group_by(sex) %>%
  summarize(count = sum(!is.na(year)))

Group the data by the variable sex and create a new varibale with total no. of non-missing values of year in each class of group.

Chunk 16

dta_gw <- dta %>% 
  filter(!is.na(weight)) %>%
  group_by(genus, plot_id) %>%
  summarize(mean_weight = mean(weight))

Get the rows without missing value in weight.
Group the data by variables genus and plot_id.
Compute weight means for each group.
Save the data and name it dta_gw

Chunk 17

glimpse(dta_gw)

## Rows: 196
## Columns: 3
## Groups: genus [10]
## $ genus       <chr> "Baiomys", "Baiomys", "Baiomys", "Baiomys", "Baiomy…
## $ plot_id     <dbl> 1, 2, 3, 5, 18, 19, 20, 21, 1, 2, 3, 4, 5, 6, 7, 8,…
## $ mean_weight <dbl> 7.000000, 6.000000, 8.611111, 7.750000, 9.500000, 9…

Get the basic info of dta_gw, including its dimension and names of variables.

Chunk 18

dta_w <- dta_gw %>%
  spread(key = genus, value = mean_weight)

Ungroup (spread) the data by the variable genus to get the wide data format contains columns of classes in genus and values of mean_weight.
Save the data and name it dta_w.

Chunk 19

glimpse(dta_w)

## Rows: 24
## Columns: 11
## $ plot_id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ Baiomys         <dbl> 7.000000, 6.000000, 8.611111, NA, 7.750000, NA,…
## $ Chaetodipus     <dbl> 22.19939, 25.11014, 24.63636, 23.02381, 17.9827…
## $ Dipodomys       <dbl> 60.23214, 55.68259, 52.04688, 57.52454, 51.1135…
## $ Neotoma         <dbl> 156.2222, 169.1436, 158.2414, 164.1667, 190.037…
## $ Onychomys       <dbl> 27.67550, 26.87302, 26.03241, 28.09375, 27.0169…
## $ Perognathus     <dbl> 9.625000, 6.947368, 7.507812, 7.824427, 8.65853…
## $ Peromyscus      <dbl> 22.22222, 22.26966, 21.37037, 22.60000, 21.2317…
## $ Reithrodontomys <dbl> 11.375000, 10.680556, 10.516588, 10.263158, 11.…
## $ Sigmodon        <dbl> NA, 70.85714, 65.61404, 82.00000, 82.66667, 68.…
## $ Spermophilus    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Get the basic info of dta_w, including its dimension and names of variables.

Chunk 20

dta_gw %>%
  spread(genus, mean_weight, fill = 0) %>%
  head()

Ungroup (spread) the data by the variable genus to get the wide data format contains columns of classes in genus and values of mean_weight.
Fill the missing values with 0.
Display the first 6 rows.

Chunk 21

dta_l <- dta_w %>%
  gather(key = genus, value = mean_weight, -plot_id)

Stack (gather) the data by the variable genus to get the long data format that contains a single column genus with different classes and a single column with values of mean_weight.
Drop out the column plot_id.
Save the data and name it dta_l.

Chunk 22

glimpse(dta_l)

## Rows: 240
## Columns: 3
## $ plot_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ genus       <chr> "Baiomys", "Baiomys", "Baiomys", "Baiomys", "Baiomy…
## $ mean_weight <dbl> 7.000000, 6.000000, 8.611111, NA, 7.750000, NA, NA,…

Get the basic info of dta_l, including its dimension and names of variables.

Chunk 23

dta_w %>%
  gather(key = genus, value = mean_weight, Baiomys:Spermophilus) %>%
  head()

select column from Baiomys to Spermophilus in dta_w.
Stack (gather) the data by the variable genus to get the long data format that contains a single column genus with different classes and a single column with values of mean_weight.
Display the first 6 rows

Chunk 24

dta_complete <- dta %>%
  filter(!is.na(weight),           
         !is.na(hindfoot_length),  
         !is.na(sex))

Get the rows in dta without missing values in columns weight, hindfoot_length, or sex.
Save the data and name it dta_complete.

Chunk 25

species_counts <- dta_complete %>%
    count(species_id) %>% 
    filter(n >= 50)

Count no. of the species in the complete data (dta_complete).
Get the rows that species counts are not less than 50.
Save the data and name it species_counts.

Chunk 26

dta_complete <- dta_complete %>%
  filter(species_id %in% species_counts$species_id)

Revise dta_complete: Retain data that species id appears in species_counts. That is, drop out the data that species counts are less than 50.

DataM: Inclass Exercise 0330 - 3