The data set concerns species and weight of animals caught in plots in a study area in Arizona over time.
Each row holds information for a single animal, and the columns represent:
pacman conveniently wraps library and "tidyverse package related functions and names them in an intuitive and consistent fashion.
pacman::p_load(tidyverse)
read data
dta <- read_csv("http://kbroman.org/datacarp/portal_data_joined.csv")
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## plot_id = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
get a glimpse of the data. We learned that there are 13 variables and 34786 observations.
glimpse(dta)
## Observations: 34,786
## Variables: 13
## $ record_id <dbl> 1, 72, 224, 266, 349, 363, 435, 506, 588, 661, 748,...
## $ month <dbl> 7, 8, 9, 10, 11, 11, 12, 1, 2, 3, 4, 5, 6, 8, 9, 10...
## $ day <dbl> 16, 19, 13, 16, 12, 12, 10, 8, 18, 11, 8, 6, 9, 5, ...
## $ year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1978, 197...
## $ plot_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ species_id <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL...
## $ sex <chr> "M", "M", NA, NA, NA, NA, NA, NA, "M", NA, NA, "M",...
## $ hindfoot_length <dbl> 32, 31, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, NA,...
## $ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 218, NA, NA, 204, 2...
## $ genus <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotom...
## $ species <chr> "albigula", "albigula", "albigula", "albigula", "al...
## $ taxa <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "...
## $ plot_type <chr> "Control", "Control", "Control", "Control", "Contro...
We learned that there are 34786 rows and 13 columns.
dim(dta)
## [1] 34786 13
We select variable (plot_id, species_id, weight) and take a look at the data on the 6 rows.
dplyr::select(dta, plot_id, species_id, weight) %>% head()
We select all the other variables except (record_id, species_id) and check the data on the first rows.
dplyr::select(dta, -record_id, -species_id) %>% head()
We filter the data with the condition that the variable(year) is 1995 and check the data on the head rows.
dplyr::filter(dta, year == 1995) %>% head()
head(dplyr::select(dplyr::filter(dta, weight <= 5), species_id, sex, weight))
We filter the data with the condition that the variable(weight) is <=5 and then select three varibles(species_id, sex, weight) and then take a look of the data on the first 6 rows.
dta %>%
dplyr::filter(weight <= 5) %>%
dplyr::select(species_id, sex, weight) %>%
head
We create two new variables while keep the existing variable. One weight_kg = weight / 1000. One weight_lb = weight_kg * 2.2. And then we take a look of the data on the first 6 rows.
dta %>%
mutate(weight_kg = weight / 1000,
weight_lb = weight_kg * 2.2) %>%
head()
We filter the variable with the condition that the variable“weight” is not a missing data. And then we group them by “sex, species_id”. And then we summarize them by their means while doing that we give it the name “mean_weight”. And then we arrange the mean_weight by descending order. And then we check the head data.
dta %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
arrange(desc(mean_weight)) %>%
head()
dta %>%
group_by(sex) %>%
tally
Count() is similar to tally(), but count() calls group by() before and ungroup()after.
dta %>%
count(sex)
Another expression for counting the numbers by the group of sex
dta %>%
group_by(sex) %>%
summarize(count = n())
We group by sex. And then we count the sum but exclude the missing data on the variable(year)
dta %>%
group_by(sex) %>%
summarize(count = sum(!is.na(year)))
We create new data.frame “dta_gw” with the data from dta. And then we exclude the missing data on weight. And then we group by two variables “genus” and “plot_id”. And then we summarize them by the weight name. We give it a new name by “mean_weight”.
dta_gw <- dta %>%
filter(!is.na(weight)) %>%
group_by(genus, plot_id) %>%
summarize(mean_weight = mean(weight))
We take a glimpse of dta_gw
glimpse(dta_gw)
## Observations: 196
## Variables: 3
## Groups: genus [10]
## $ genus <chr> "Baiomys", "Baiomys", "Baiomys", "Baiomys", "Baiomys", ...
## $ plot_id <dbl> 1, 2, 3, 5, 18, 19, 20, 21, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ mean_weight <dbl> 7.000000, 6.000000, 8.611111, 7.750000, 9.500000, 9.533...
We create new data.frame“dta_w” with the data from dta_gw. We make a wide format by spreading the genus and mean_weight.
dta_w <- dta_gw %>%
spread(key = genus, value = mean_weight)
we now take a glimpse of dta_w
glimpse(dta_w)
## Observations: 24
## Variables: 11
## $ plot_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ Baiomys <dbl> 7.000000, 6.000000, 8.611111, NA, 7.750000, NA, NA,...
## $ Chaetodipus <dbl> 22.19939, 25.11014, 24.63636, 23.02381, 17.98276, 2...
## $ Dipodomys <dbl> 60.23214, 55.68259, 52.04688, 57.52454, 51.11356, 5...
## $ Neotoma <dbl> 156.2222, 169.1436, 158.2414, 164.1667, 190.0370, 1...
## $ Onychomys <dbl> 27.67550, 26.87302, 26.03241, 28.09375, 27.01695, 2...
## $ Perognathus <dbl> 9.625000, 6.947368, 7.507812, 7.824427, 8.658537, 7...
## $ Peromyscus <dbl> 22.22222, 22.26966, 21.37037, 22.60000, 21.23171, 2...
## $ Reithrodontomys <dbl> 11.375000, 10.680556, 10.516588, 10.263158, 11.1545...
## $ Sigmodon <dbl> NA, 70.85714, 65.61404, 82.00000, 82.66667, 68.7777...
## $ Spermophilus <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
We make a wide format by spreading the key=genus, and value=mean_weight. We fill the missing value with the value=0. And then we check the head data.
dta_gw %>%
spread(genus, mean_weight, fill = 0) %>%
head()
We create new data.frame from dta_w. gather()is the opposite of spread(). We create a long format by thgatheringenus, in one column and gather the mean_weight associated with each genus in a second column variable to correspoing plot_id``{r} dta_l <- dta_w %>% gather(key = genus, value = mean_weight, -plot_id)
`We take a glimpse of dta_1``{r}
glimpse(dta_l)
`We gather columns and collapsed into genus-mean_weight pair and make a long format. "Baiomys:Spermophilus" meas we gather the genus from Baiomys to Spermophilus.{r} dta_w %>% gather(key = genus, value = mean_weight, Baiomys:Spermophilus) %>% head()
``
We create a data.frame(dta_complete) from dta. We choose the variable (weight, hindfoot_length, and sex. And we choose ) and omit the missing data.
`{r}
dta_complete <- dta %>%
filter(!is.na(weight),
!is.na(hindfoot_length),
!is.na(sex))
`WE create dataframe species_counts from dta_complete.And then we count the numbers in the variable(species_id) and we choose those n>=50``{r} species_counts <- dta_complete %>% count(species_id) %>% filter(n >= 50)
`We revise data and choose only species_id within species_counts ``{r}
dta_complete <- dta_complete %>%
filter(species_id %in% species_counts$species_id)