DataM: In-class Exercise 0427: Function

In-class exercise 1.
In-class exercise 2.
In-class exercise 3.
- Method 1
- Method 2
- Method 3
- Method 4
- Method 5
In-class exercise 4.
- Binning
- Sorting
- Counting
- Aggregating
In-class exercise 5.

library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)

In-class exercise 1.

Split the ChickWeight{datasets} data by individual chicks to extract separate slope estimates of regressing weight onto Time for each chick.

[Solutions and answers]

Load in the dataset and check its structure

head(ChickWeight)

str(ChickWeight)

Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"

summary(ChickWeight)

     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506

slopes <- ChickWeight %>% with(., split(., Chick)) %>%
  sapply(., function(x) coefficients(lm(x$weight ~ x$Time))[2])
names(slopes) <- substr(names(slopes), 1, nchar(names(slopes)) - 7)

Display slopes for each chick in the descending order.

data.frame(sort(slopes, decreasing = TRUE))

Data visualization (still in the descending order of the slopes)

ChickWeight %>% 
  lattice::xyplot(weight ~ Time | Chick, data = ., type = c('p','r','g'),
                  index.cond = function(x, y) coefficients(lm(y ~ x))[2])

In-class exercise 2.

Explain what does this statement do:

lapply(lapply(search(), ls), length)

[Solutions and answers]

Split above code into several steps:

Find packages loading in the current environment. It returns a vector.

search()

 [1] ".GlobalEnv"        "package:forcats"   "package:stringr"  
 [4] "package:purrr"     "package:readr"     "package:tibble"   
 [7] "package:ggplot2"   "package:tidyverse" "package:tidyr"    
[10] "package:dplyr"     "package:stats"     "package:graphics" 
[13] "package:grDevices" "package:utils"     "package:datasets" 
[16] "package:methods"   "Autoloads"         "package:base"

is.vector(search())

[1] TRUE

Conduct ls (show the list) command for each package loading in the current environment. That is, show the lists of commands, functions, datasets, …, etc, in each of package. Since the output is too long, I do not show the output here.

lapply(search(), ls)

Then, we can figure out what the given script does. It computes the length of the list (the number of commands, functions, datasets, …, etc) for each package. But the output still contains too many lines to be easily displayed. Thus, I turn the outter lapply into sapply to make the output be a vector instead of a list.

lapply(lapply(search(), ls), length)

sapply(lapply(search(), ls), length)

 [1]    1   33   52  178  109   42  515    5   77  267  448   87  109  215  104
[16]  218    0 1229

We also can use the data.frame and turn the outter lapply into sapply to display the name and the length of the list together for each package:

data.frame(package.name = search(),
           no.element = sapply(lapply(search(), ls), length))

In-class exercise 3.

The following R script uses Cushings{MASS} to demonstrates several ways to achieve the same objective in R. Explain the advantages or disadvantages of each method.

#
# Cushings example
#
library(MASS)

Create a dataset with missing values to test.

dta_test <- Cushings
dta_test[10,3] <- NA

Method 1

aggregate( . ~ Type, data = Cushings, mean)

aggregate( . ~ Type, data = dta_test, mean)

Advantages:
1. The command is straightforward.
2. It can deal with missing values automatically.
3. The output is a data frame, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.

Method 2

# The original code
sapply(split(Cushings[,-3], Cushings$Type), function(x) apply(x, 2, mean))

                           a    b     c        u
Tetrahydrocortisone 2.966667 8.18 19.72 14.01667
Pregnanetriol       2.440000 1.12  5.50  1.20000

# Test with missing values
sapply(split(dta_test[,-3], dta_test$Type), function(x) apply(x, 2, mean))

                           a        b     c        u
Tetrahydrocortisone 2.966667 8.222222 19.72 14.01667
Pregnanetriol       2.440000 1.111111  5.50  1.20000

# Other comments
lapply(split(Cushings[,-3], Cushings$Type), function(x) apply(x, 2, mean)) %>%
  sapply(., sd)

         a          b          c          u 
 0.3724096  4.9921739 10.0550584  9.0627519

Advantages:
1. It can deal with missing values automatically.
2. The output is a matrix, which can be operated with matrix operation.
Disadvantage:
1. Compare to method 1, this command is relatively not straightforward. We need to specify the index vaiable with its dataset name instead of using its own name only (e.g., Cushings$Type vs Type). We also have to exclude the index variable when splitting the dataset.
2. Command function(x) apply(x, 2, mean) can be turned into colMeans to be simpler.
Other comments
1. If we use lapply instead of sapply, we can do more list operation since the output is a list.

Method 3

do.call("rbind", as.list(  # bind the list by the rows
  by(Cushings, list(Cushings$Type), function(x) {
    y <- subset(x, select = -Type) # create subset without the index variable
    apply(y, 2, mean)  # compute the colunm means
  }
)))

  Tetrahydrocortisone Pregnanetriol
a            2.966667          2.44
b            8.180000          1.12
c           19.720000          5.50
u           14.016667          1.20

# Test for missing value
do.call("rbind", as.list(  # bind the list by the rows
  by(dta_test, list(dta_test$Type), function(x) {
    y <- subset(x, select = -Type) # create subset without the index variable
    apply(y, 2, mean)  # compute the colunm means
  }
)))

  Tetrahydrocortisone Pregnanetriol
a            2.966667      2.440000
b            8.222222      1.111111
c           19.720000      5.500000
u           14.016667      1.200000

Advantage:
1. It can deal with missing values automatically.
2. The output is a matrix, which can be operated with matrix operation.
Disadvantage:
1. This command is not straightforward.

Method 4

Cushings %>%
 group_by(Type) %>%
 summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))

# Test for missing values
dta_test %>%
 group_by(Type) %>%
 summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))

Advantages:
1. The code with dplyr commands is very straightforward.
2. The names of the new varibles can be specified when being created.
3. It can deal with missing values automatically.
4. The output is a data table, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.

Method 5

Cushings %>%
 nest(-Type) %>%
 mutate(avg = map(data, ~ apply(., 2, mean)), 
        res_1 = map_dbl(avg, "Tetrahydrocortisone"), 
        res_2 = map_dbl(avg, "Pregnanetriol"))

# Test for missing values
dta_test %>%
  nest(-Type) %>%
  mutate(avg = map(data, ~ apply(., 2, mean)),
         res_1 = map_dbl(avg, "Tetrahydrocortisone"), 
         res_2 = map_dbl(avg, "Pregnanetriol"))

Advantages:
1. The code with dplyr and tidyr commands is very straightforward.
2. The names of the new varibles can be specified when being created.
3. The output is a data table, which can be easily operated with dplyr and tidyr and be displayed with knitr table tiddly.
4. Compared to method 4, the output of method 5 extraly show the data type of each group of the index variable.
Disadvantage:
1. The code is not as straightforward as method 4.
2. It is possible that conflicts take place when using both dplyr and tidyr commands.

In-class exercise 4.

Go through the script in the NZ schools example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.

The New Zealand Ministry of Education provides basic information for all primary and secondary schools in the country.

Source: Ministry of education - New Zealand

Column 1: School ID
Column 2: School name
Column 3: City where the school is located
Column 4: The authority of the school
Column 5: A socio-economic status of the families of the students of the school
Column 6: The number of students enrolled at the school as of July 2007

Load in the data set.

dta <- read.csv("../data/nzSchools.csv", as.is=2)  
# Keep the school names with white spaces

Check the structure and the dimension of the data set.

str(dta)

'data.frame':   2571 obs. of  6 variables:
 $ ID  : int  1015 1052 1062 1092 1130 1018 1029 1030 1588 1154 ...
 $ Name: chr  "Hora Hora School" "Morningside School" "Onerahi School" "Raurimu Avenue School" ...
 $ City: Factor w/ 541 levels "Ahaura","Ahipara",..: 533 533 533 533 533 533 533 533 533 533 ...
 $ Auth: Factor w/ 4 levels "Other","Private",..: 3 3 3 3 3 3 3 3 4 3 ...
 $ Dec : int  2 3 4 2 4 8 5 5 6 1 ...
 $ Roll: int  318 200 455 86 577 329 637 395 438 201 ...

dim(dta)

[1] 2571    6

Binning

Binning - 1

Group Roll into 2 groups (create a binary variable to label this).

dta$Size <- ifelse(dta$Roll > median(dta$Roll), "Large", "Small")
dta$Size <- NULL  # clean this variable
head(dta) # check the top 6 rows

Binning - 2

Group Roll into 3 groups. Find the freq. table of the grouping variable

dta$Size <- cut(dta$Roll, 3, labels=c("Small", "Mediam", "Large"))
table(dta$Size)


 Small Mediam  Large 
  2555     15      1

Sorting

1. Create a new variable `RollOrd` containing the decreasing order of `Roll`.

dta$RollOrd <- order(dta$Roll, decreasing=T)

2. Show the data with 6 largest `Roll`.

head(dta[dta$RollOrd, ])

3. Show the data with 6 smallest `Roll`.

tail(dta[dta$RollOrd, ])

4. Show the top 6 rows of the dataset in the decreasing order of `City` and `Roll`.

City is a factorial variable with character type, whose first letter would be used in the alphabet orber when sorting.

head(dta[order(dta$City, dta$Roll, decreasing=T), ])

5. Show the last 6 rows of the dataset in the decreasing order of `City` and `Roll`.

tail(dta[order(dta$City, dta$Roll, decreasing=T), ])

Counting

Find the freq. table of variable `Auth`.

table(dta$Auth)


           Other          Private            State State Integrated 
               1               99             2144              327

Save the freq. table of variable and show it.

authtbl <- table(dta$Auth); authtbl


           Other          Private            State State Integrated 
               1               99             2144              327

Find the type of `authtbl`.

class(authtbl)

[1] "table"

Show the dataset whose `Auth` is 'Other'.

dta[dta$Auth == "Other", ]

Show the contingency table of `Auth` and `Dec`.

xtabs(~ Auth + Dec, data=dta)

                  Dec
Auth                 1   2   3   4   5   6   7   8   9  10
  Other              1   0   0   0   0   0   0   0   0   0
  Private            0   0   2   6   2   2   6  11  12  38
  State            259 230 208 219 214 215 188 200 205 205
  State Integrated  12  22  35  28  38  34  45  45  37  31

Aggregating

Compute the mean of `Roll{dta}`.

mean(dta$Roll)

[1] 295.4737

Compute the `Roll` mean of rows whose `Auth` is 'Private'.

mean(dta$Roll[dta$Auth == "Private"])

[1] 308.798

Compute `Roll` mean for each group of `Auth`.

aggregate(dta["Roll"], by=list(dta$Auth), FUN=mean)

Group `Dec` into two groups. Create a binary variable `Rich` to label this.

[Note] Since the original command dta$Rich would show too many lines, I turn it into table(dta$Rich) for a tidier output.

dta$Rich <- dta$Dec > 5; table(dta$Rich)


FALSE  TRUE 
 1276  1274

Compute `Roll` mean for each cross group of `Auth` and `Rich`. The output is a data frame with the long format.

aggregate(dta["Roll"], by=list(dta$Auth, dta$Rich), FUN=mean)

Find the range of `Roll` (e.g., MIN and MAX) for each group of `Auth`. The output is a by format.

by(dta["Roll"], INDICES=list(dta$Auth), FUN=range)

: Other
[1] 51 51
------------------------------------------------------------ 
: Private
[1]    7 1663
------------------------------------------------------------ 
: State
[1]    5 5546
------------------------------------------------------------ 
: State Integrated
[1]   18 1475

In-class exercise 5.

Go through the script in the NCEA 2007 example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.

Students’ learning in secondary schools are measured by the National Certificates of Educational Achievement (NCEA) in New Zealand. Students usually try to attain NCEA Level 1 in their third year of secondary schooling, Level 2 in their fourth year, and Level 3 in their fifth and final year of secondary school. The percentage of students who achieved each NCEA level is reported annually for all New Zealand secondary schools. The data set contains NCEA achievement percentages for 2007.

Column 1: School name
Column 2: Achievement percentages for Level 1
Column 3: Achievement percentages for Level 2
Column 4: Achievement percentages for Level 3

Load and check

Check the structure and the dimension of the data set. Show the first 6 rows of the dataset.

dta2 <- read.table("../data/NCEA2007.txt", sep=":", quote="", h=T, as.is=T)
dim(dta2)

[1] 88  4

str(dta2)

'data.frame':   88 obs. of  4 variables:
 $ Name  : chr  "Al-Madinah School" "Alfriston College" "Ambury Park Centre for Riding Therapy" "Aorere College" ...
 $ Level1: num  61.5 53.9 33.3 39.5 71.2 22.1 50.8 57.3 89.3 59.8 ...
 $ Level2: num  75 44.1 20 50.2 78.9 30.8 34.8 49.8 89.7 65.7 ...
 $ Level3: num  0 0 0 30.6 55.5 26.3 48.9 44.6 88.6 50.4 ...

head(dta2)

Apply

Apply for one dimension of an array.

Compute the column means (group means). The output is a named vector.

apply(dta2[, -1], MARGIN=2, FUN=mean)

  Level1   Level2   Level3 
62.26705 61.06818 47.97614

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

apply(dta2[, -1], MARGIN=2, FUN=range)

     Level1 Level2 Level3
[1,]    2.8    0.0    0.0
[2,]   97.4   95.7   95.7

List apply

Compute the group means. The output is a list.

lapply(dta2[, -1], FUN=mean)

$Level1
[1] 62.26705

$Level2
[1] 61.06818

$Level3
[1] 47.97614

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a list.

lapply(dta2[, -1], FUN=range)

$Level1
[1]  2.8 97.4

$Level2
[1]  0.0 95.7

$Level3
[1]  0.0 95.7

Simplify the list apply

Compute the group means. The output is a named vector.

sapply(dta2[, -1], FUN=mean)

  Level1   Level2   Level3 
62.26705 61.06818 47.97614

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

sapply(dta2[, -1], FUN=range)

     Level1 Level2 Level3
[1,]    2.8    0.0    0.0
[2,]   97.4   95.7   95.7

Splitting

Split the variable `dta$Roll` by the index variable `dta$Auth`.

The splited variable is a integer vector. The index variable have 4 levels. The splitting output is a list with 4 elements, which are integer vectors.

rollsByAuth <- split(dta$Roll, dta$Auth)
class(rollsByAuth)  # Find the type of the output

[1] "list"

str(rollsByAuth)    # Find the structure of the output

List of 4
 $ Other           : int 51
 $ Private         : int [1:99] 255 39 154 73 83 25 95 85 94 729 ...
 $ State           : int [1:2144] 318 200 455 86 577 329 637 395 201 267 ...
 $ State Integrated: int [1:327] 438 26 191 560 151 114 126 171 211 57 ...

Compute `Roll` mean for each group (level) of `Auth`. The output is a list with 4 elements.

lapply(rollsByAuth , mean)

$Other
[1] 51

$Private
[1] 308.798

$State
[1] 300.6301

$`State Integrated`
[1] 258.3792

Compute `Roll` mean for each group (level) of `Auth`. The output is a named vector with 4 elements.

sapply(rollsByAuth, mean)

           Other          Private            State State Integrated 
         51.0000         308.7980         300.6301         258.3792

DataM: In-class Exercise 0427: Function

DataM: In-class Exercise 0427: Function

In-class exercise 1.

Load in the dataset and check its structure

Display slopes for each chick in the descending order.

Data visualization (still in the descending order of the slopes)

In-class exercise 2.

In-class exercise 3.

Method 1

Method 2

Method 3

Method 4

Method 5

In-class exercise 4.

Load in the data set.

Check the structure and the dimension of the data set.

Binning

Binning - 1

Binning - 2

Sorting

1. Create a new variable RollOrd containing the decreasing order of Roll.

2. Show the data with 6 largest Roll.

3. Show the data with 6 smallest Roll.

4. Show the top 6 rows of the dataset in the decreasing order of City and Roll.

5. Show the last 6 rows of the dataset in the decreasing order of City and Roll.

Counting

Find the freq. table of variable Auth.

Save the freq. table of variable and show it.

Find the type of authtbl.

Show the dataset whose Auth is 'Other'.

Show the contingency table of Auth and Dec.

Aggregating

Compute the mean of Roll{dta}.

Compute the Roll mean of rows whose Auth is 'Private'.

Compute Roll mean for each group of Auth.

Group Dec into two groups. Create a binary variable Rich to label this.

Compute Roll mean for each cross group of Auth and Rich. The output is a data frame with the long format.

Find the range of Roll (e.g., MIN and MAX) for each group of Auth. The output is a by format.

In-class exercise 5.

Load and check

Apply

Apply for one dimension of an array.

Compute the column means (group means). The output is a named vector.

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

List apply

Compute the group means. The output is a list.

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a list.

Simplify the list apply

Compute the group means. The output is a named vector.

Find the range for each group (e.g., the output contains two values, MIN and MAX). The output is a matrix.

Splitting

Split the variable dta$Roll by the index variable dta$Auth.

Compute Roll mean for each group (level) of Auth. The output is a list with 4 elements.

Compute Roll mean for each group (level) of Auth. The output is a named vector with 4 elements.

1. Create a new variable `RollOrd` containing the decreasing order of `Roll`.

2. Show the data with 6 largest `Roll`.

3. Show the data with 6 smallest `Roll`.

4. Show the top 6 rows of the dataset in the decreasing order of `City` and `Roll`.

5. Show the last 6 rows of the dataset in the decreasing order of `City` and `Roll`.

Find the freq. table of variable `Auth`.

Find the type of `authtbl`.

Show the dataset whose `Auth` is 'Other'.

Show the contingency table of `Auth` and `Dec`.

Compute the mean of `Roll{dta}`.

Compute the `Roll` mean of rows whose `Auth` is 'Private'.

Compute `Roll` mean for each group of `Auth`.

Group `Dec` into two groups. Create a binary variable `Rich` to label this.

Compute `Roll` mean for each cross group of `Auth` and `Rich`. The output is a data frame with the long format.

Find the range of `Roll` (e.g., MIN and MAX) for each group of `Auth`. The output is a by format.

Split the variable `dta$Roll` by the index variable `dta$Auth`.

Compute `Roll` mean for each group (level) of `Auth`. The output is a list with 4 elements.

Compute `Roll` mean for each group (level) of `Auth`. The output is a named vector with 4 elements.