DataM: In-class Exercise 0427: Function

library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)

In-class exercise 1.

Split the ChickWeight{datasets} data by individual chicks to extract separate slope estimates of regressing weight onto Time for each chick.

[Solutions and answers]

Load in the dataset and check its structure

head(ChickWeight)
str(ChickWeight)
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"
summary(ChickWeight)
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506          
slopes <- ChickWeight %>% with(., split(., Chick)) %>%
  sapply(., function(x) coefficients(lm(x$weight ~ x$Time))[2])
names(slopes) <- substr(names(slopes), 1, nchar(names(slopes)) - 7)

Display slopes for each chick in the descending order.

data.frame(sort(slopes, decreasing = TRUE))

Data visualization (still in the descending order of the slopes)

ChickWeight %>% 
  lattice::xyplot(weight ~ Time | Chick, data = ., type = c('p','r','g'),
                  index.cond = function(x, y) coefficients(lm(y ~ x))[2])


In-class exercise 2.

Explain what does this statement do:

lapply(lapply(search(), ls), length)

[Solutions and answers]

Split above code into several steps:

  1. Find packages loading in the current environment. It returns a vector.
search()
 [1] ".GlobalEnv"        "package:forcats"   "package:stringr"  
 [4] "package:purrr"     "package:readr"     "package:tibble"   
 [7] "package:ggplot2"   "package:tidyverse" "package:tidyr"    
[10] "package:dplyr"     "package:stats"     "package:graphics" 
[13] "package:grDevices" "package:utils"     "package:datasets" 
[16] "package:methods"   "Autoloads"         "package:base"     
is.vector(search())
[1] TRUE
  1. Conduct ls (show the list) command for each package loading in the current environment. That is, show the lists of commands, functions, datasets, …, etc, in each of package. Since the output is too long, I do not show the output here.
lapply(search(), ls)
  1. Then, we can figure out what the given script does. It computes the length of the list (the number of commands, functions, datasets, …, etc) for each package. But the output still contains too many lines to be easily displayed. Thus, I turn the outter lapply into sapply to make the output be a vector instead of a list.
lapply(lapply(search(), ls), length)
sapply(lapply(search(), ls), length)
 [1]    1   33   52  178  109   42  515    5   77  267  448   87  109  215  104
[16]  218    0 1229

We also can use the data.frame and turn the outter lapply into sapply to display the name and the length of the list together for each package:

data.frame(package.name = search(),
           no.element = sapply(lapply(search(), ls), length))


In-class exercise 3.

The following R script uses Cushings{MASS} to demonstrates several ways to achieve the same objective in R. Explain the advantages or disadvantages of each method.

#
# Cushings example
#
library(MASS)

Create a dataset with missing values to test.

dta_test <- Cushings
dta_test[10,3] <- NA

Method 1

aggregate( . ~ Type, data = Cushings, mean)
aggregate( . ~ Type, data = dta_test, mean)
  • Advantages:
    1. The command is straightforward.
    2. It can deal with missing values automatically.
    3. The output is a data frame, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.


Method 2

# The original code
sapply(split(Cushings[,-3], Cushings$Type), function(x) apply(x, 2, mean))
                           a    b     c        u
Tetrahydrocortisone 2.966667 8.18 19.72 14.01667
Pregnanetriol       2.440000 1.12  5.50  1.20000
# Test with missing values
sapply(split(dta_test[,-3], dta_test$Type), function(x) apply(x, 2, mean)) 
                           a        b     c        u
Tetrahydrocortisone 2.966667 8.222222 19.72 14.01667
Pregnanetriol       2.440000 1.111111  5.50  1.20000
# Other comments
lapply(split(Cushings[,-3], Cushings$Type), function(x) apply(x, 2, mean)) %>%
  sapply(., sd)
         a          b          c          u 
 0.3724096  4.9921739 10.0550584  9.0627519 
  • Advantages:
    1. It can deal with missing values automatically.
    2. The output is a matrix, which can be operated with matrix operation.
  • Disadvantage:
    1. Compare to method 1, this command is relatively not straightforward. We need to specify the index vaiable with its dataset name instead of using its own name only (e.g., Cushings$Type vs Type). We also have to exclude the index variable when splitting the dataset.
    2. Command function(x) apply(x, 2, mean) can be turned into colMeans to be simpler.
  • Other comments
    1. If we use lapply instead of sapply, we can do more list operation since the output is a list.


Method 3

do.call("rbind", as.list(  # bind the list by the rows
  by(Cushings, list(Cushings$Type), function(x) {
    y <- subset(x, select = -Type) # create subset without the index variable
    apply(y, 2, mean)  # compute the colunm means
  }
)))
  Tetrahydrocortisone Pregnanetriol
a            2.966667          2.44
b            8.180000          1.12
c           19.720000          5.50
u           14.016667          1.20
# Test for missing value
do.call("rbind", as.list(  # bind the list by the rows
  by(dta_test, list(dta_test$Type), function(x) {
    y <- subset(x, select = -Type) # create subset without the index variable
    apply(y, 2, mean)  # compute the colunm means
  }
)))
  Tetrahydrocortisone Pregnanetriol
a            2.966667      2.440000
b            8.222222      1.111111
c           19.720000      5.500000
u           14.016667      1.200000
  • Advantage:
    1. It can deal with missing values automatically.
    2. The output is a matrix, which can be operated with matrix operation.
  • Disadvantage:
    1. This command is not straightforward.


Method 4

Cushings %>%
 group_by(Type) %>%
 summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))
# Test for missing values
dta_test %>%
 group_by(Type) %>%
 summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))
  • Advantages:
    1. The code with dplyr commands is very straightforward.
    2. The names of the new varibles can be specified when being created.
    3. It can deal with missing values automatically.
    4. The output is a data table, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.

Method 5

Cushings %>%
 nest(-Type) %>%
 mutate(avg = map(data, ~ apply(., 2, mean)), 
        res_1 = map_dbl(avg, "Tetrahydrocortisone"), 
        res_2 = map_dbl(avg, "Pregnanetriol")) 
# Test for missing values
dta_test %>%
  nest(-Type) %>%
  mutate(avg = map(data, ~ apply(., 2, mean)),
         res_1 = map_dbl(avg, "Tetrahydrocortisone"), 
         res_2 = map_dbl(avg, "Pregnanetriol")) 
  • Advantages:
    1. The code with dplyr and tidyr commands is very straightforward.
    2. The names of the new varibles can be specified when being created.
    3. The output is a data table, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.
    4. Compared to method 4, the output of method 5 extraly show the data type of each group of the index variable.
  • Disadvantage:
    1. The code is not as straightforward as method 4.
    2. It is possible that conflicts take place when using both dplyr and tidyr commands.


In-class exercise 4.

Go through the script in the NZ schools example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.

The New Zealand Ministry of Education provides basic information for all primary and secondary schools in the country.

Source: Ministry of education - New Zealand

  • Column 1: School ID
  • Column 2: School name
  • Column 3: City where the school is located
  • Column 4: The authority of the school
  • Column 5: A socio-economic status of the families of the students of the school
  • Column 6: The number of students enrolled at the school as of July 2007

Load in the data set.

dta <- read.csv("../data/nzSchools.csv", as.is=2)  
# Keep the school names with white spaces

Check the structure and the dimension of the data set.

str(dta)
'data.frame':   2571 obs. of  6 variables:
 $ ID  : int  1015 1052 1062 1092 1130 1018 1029 1030 1588 1154 ...
 $ Name: chr  "Hora Hora School" "Morningside School" "Onerahi School" "Raurimu Avenue School" ...
 $ City: Factor w/ 541 levels "Ahaura","Ahipara",..: 533 533 533 533 533 533 533 533 533 533 ...
 $ Auth: Factor w/ 4 levels "Other","Private",..: 3 3 3 3 3 3 3 3 4 3 ...
 $ Dec : int  2 3 4 2 4 8 5 5 6 1 ...
 $ Roll: int  318 200 455 86 577 329 637 395 438 201 ...
dim(dta)
[1] 2571    6

Binning

Binning - 1

Group Roll into 2 groups (create a binary variable to label this).

dta$Size <- ifelse(dta$Roll > median(dta$Roll), "Large", "Small")
dta$Size <- NULL  # clean this variable
head(dta) # check the top 6 rows

Binning - 2

Group Roll into 3 groups. Find the freq. table of the grouping variable

dta$Size <- cut(dta$Roll, 3, labels=c("Small", "Mediam", "Large"))
table(dta$Size) 

 Small Mediam  Large 
  2555     15      1 

Sorting

1. Create a new variable RollOrd containing the decreasing order of Roll.

dta$RollOrd <- order(dta$Roll, decreasing=T)

2. Show the data with 6 largest Roll.

head(dta[dta$RollOrd, ])

3. Show the data with 6 smallest Roll.

tail(dta[dta$RollOrd, ])

4. Show the top 6 rows of the dataset in the decreasing order of City and Roll.

City is a factorial variable with character type, whose first letter would be used in the alphabet orber when sorting.

head(dta[order(dta$City, dta$Roll, decreasing=T), ])

5. Show the last 6 rows of the dataset in the decreasing order of City and Roll.

tail(dta[order(dta$City, dta$Roll, decreasing=T), ])

Counting

Find the freq. table of variable Auth.

table(dta$Auth)

           Other          Private            State State Integrated 
               1               99             2144              327 

Save the freq. table of variable and show it.

authtbl <- table(dta$Auth); authtbl

           Other          Private            State State Integrated 
               1               99             2144              327 

Find the type of authtbl.

class(authtbl)
[1] "table"

Show the dataset whose Auth is 'Other'.

dta[dta$Auth == "Other", ]

Show the contingency table of Auth and Dec.

xtabs(~ Auth + Dec, data=dta)
                  Dec
Auth                 1   2   3   4   5   6   7   8   9  10
  Other              1   0   0   0   0   0   0   0   0   0
  Private            0   0   2   6   2   2   6  11  12  38
  State            259 230 208 219 214 215 188 200 205 205
  State Integrated  12  22  35  28  38  34  45  45  37  31

Aggregating

Compute the mean of Roll{dta}.

mean(dta$Roll)
[1] 295.4737

Compute the Roll mean of rows whose Auth is 'Private'.

mean(dta$Roll[dta$Auth == "Private"])
[1] 308.798

Compute Roll mean for each group of Auth.

aggregate(dta["Roll"], by=list(dta$Auth), FUN=mean)

Group Dec into two groups. Create a binary variable Rich to label this.

[Note] Since the original command dta$Rich would show too many lines, I turn it into table(dta$Rich) for a tidier output.

dta$Rich <- dta$Dec > 5; table(dta$Rich)

FALSE  TRUE 
 1276  1274 

Compute Roll mean for each cross group of Auth and Rich. The output is a data frame with the long format.

aggregate(dta["Roll"], by=list(dta$Auth, dta$Rich), FUN=mean)

Find the range of Roll (e.g., MIN and MAX) for each group of Auth. The output is a by format.

by(dta["Roll"], INDICES=list(dta$Auth), FUN=range)
: Other
[1] 51 51
------------------------------------------------------------ 
: Private
[1]    7 1663
------------------------------------------------------------ 
: State
[1]    5 5546
------------------------------------------------------------ 
: State Integrated
[1]   18 1475


In-class exercise 5.

Go through the script in the NCEA 2007 example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.

Students’ learning in secondary schools are measured by the National Certificates of Educational Achievement (NCEA) in New Zealand. Students usually try to attain NCEA Level 1 in their third year of secondary schooling, Level 2 in their fourth year, and Level 3 in their fifth and final year of secondary school. The percentage of students who achieved each NCEA level is reported annually for all New Zealand secondary schools. The data set contains NCEA achievement percentages for 2007.

  • Column 1: School name
  • Column 2: Achievement percentages for Level 1
  • Column 3: Achievement percentages for Level 2
  • Column 4: Achievement percentages for Level 3

Load and check

Check the structure and the dimension of the data set. Show the first 6 rows of the dataset.

dta2 <- read.table("../data/NCEA2007.txt", sep=":", quote="", h=T, as.is=T)
dim(dta2)
[1] 88  4
str(dta2)
'data.frame':   88 obs. of  4 variables:
 $ Name  : chr  "Al-Madinah School" "Alfriston College" "Ambury Park Centre for Riding Therapy" "Aorere College" ...
 $ Level1: num  61.5 53.9 33.3 39.5 71.2 22.1 50.8 57.3 89.3 59.8 ...
 $ Level2: num  75 44.1 20 50.2 78.9 30.8 34.8 49.8 89.7 65.7 ...
 $ Level3: num  0 0 0 30.6 55.5 26.3 48.9 44.6 88.6 50.4 ...
head(dta2)

Apply

Apply for one dimension of an array.

Compute the column means (group means). The output is a named vector.

apply(dta2[, -1], MARGIN=2, FUN=mean)
  Level1   Level2   Level3 
62.26705 61.06818 47.97614 

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

apply(dta2[, -1], MARGIN=2, FUN=range)
     Level1 Level2 Level3
[1,]    2.8    0.0    0.0
[2,]   97.4   95.7   95.7

List apply

Compute the group means. The output is a list.

lapply(dta2[, -1], FUN=mean)
$Level1
[1] 62.26705

$Level2
[1] 61.06818

$Level3
[1] 47.97614

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a list.

lapply(dta2[, -1], FUN=range)
$Level1
[1]  2.8 97.4

$Level2
[1]  0.0 95.7

$Level3
[1]  0.0 95.7

Simplify the list apply

Compute the group means. The output is a named vector.

sapply(dta2[, -1], FUN=mean)
  Level1   Level2   Level3 
62.26705 61.06818 47.97614 

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

sapply(dta2[, -1], FUN=range)
     Level1 Level2 Level3
[1,]    2.8    0.0    0.0
[2,]   97.4   95.7   95.7

Splitting

Split the variable dta$Roll by the index variable dta$Auth.

The splited variable is a integer vector. The index variable have 4 levels. The splitting output is a list with 4 elements, which are integer vectors.

rollsByAuth <- split(dta$Roll, dta$Auth)
class(rollsByAuth)  # Find the type of the output
[1] "list"
str(rollsByAuth)    # Find the structure of the output
List of 4
 $ Other           : int 51
 $ Private         : int [1:99] 255 39 154 73 83 25 95 85 94 729 ...
 $ State           : int [1:2144] 318 200 455 86 577 329 637 395 201 267 ...
 $ State Integrated: int [1:327] 438 26 191 560 151 114 126 171 211 57 ...

Compute Roll mean for each group (level) of Auth. The output is a list with 4 elements.

lapply(rollsByAuth , mean)
$Other
[1] 51

$Private
[1] 308.798

$State
[1] 300.6301

$`State Integrated`
[1] 258.3792

Compute Roll mean for each group (level) of Auth. The output is a named vector with 4 elements.

sapply(rollsByAuth, mean)
           Other          Private            State State Integrated 
         51.0000         308.7980         300.6301         258.3792 

Jay Liao

2020-05-09