DataM: In-class Exercise 0427: Function
In-class exercise 1.
Split the ChickWeight{datasets} data by individual chicks to extract separate slope estimates of regressing weight onto Time for each chick.
[Solutions and answers]
Load in the dataset and check its structure
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' language ~Diet
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"
weight Time Chick Diet
Min. : 35.0 Min. : 0.00 13 : 12 1:220
1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
Median :103.0 Median :10.00 20 : 12 3:120
Mean :121.8 Mean :10.72 10 : 12 4:118
3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
Max. :373.0 Max. :21.00 19 : 12
(Other):506
In-class exercise 2.
Explain what does this statement do:
[Solutions and answers]
Split above code into several steps:
- Find packages loading in the current environment. It returns a vector.
[1] ".GlobalEnv" "package:forcats" "package:stringr"
[4] "package:purrr" "package:readr" "package:tibble"
[7] "package:ggplot2" "package:tidyverse" "package:tidyr"
[10] "package:dplyr" "package:stats" "package:graphics"
[13] "package:grDevices" "package:utils" "package:datasets"
[16] "package:methods" "Autoloads" "package:base"
[1] TRUE
- Conduct
ls(show the list) command for each package loading in the current environment. That is, show the lists of commands, functions, datasets, …, etc, in each of package. Since the output is too long, I do not show the output here.
- Then, we can figure out what the given script does. It computes the length of the list (the number of commands, functions, datasets, …, etc) for each package. But the output still contains too many lines to be easily displayed. Thus, I turn the outter
lapplyintosapplyto make the output be a vector instead of a list.
[1] 1 33 52 178 109 42 515 5 77 267 448 87 109 215 104
[16] 218 0 1229
We also can use the data.frame and turn the outter lapply into sapply to display the name and the length of the list together for each package:
In-class exercise 3.
The following R script uses Cushings{MASS} to demonstrates several ways to achieve the same objective in R. Explain the advantages or disadvantages of each method.
Create a dataset with missing values to test.
Method 1
- Advantages:
- The command is straightforward.
- It can deal with missing values automatically.
- The output is a data frame, which can be easily operated with
dplyrandtidyrand be displayed withknitrtable tiddly.
Method 2
a b c u
Tetrahydrocortisone 2.966667 8.18 19.72 14.01667
Pregnanetriol 2.440000 1.12 5.50 1.20000
# Test with missing values
sapply(split(dta_test[,-3], dta_test$Type), function(x) apply(x, 2, mean)) a b c u
Tetrahydrocortisone 2.966667 8.222222 19.72 14.01667
Pregnanetriol 2.440000 1.111111 5.50 1.20000
# Other comments
lapply(split(Cushings[,-3], Cushings$Type), function(x) apply(x, 2, mean)) %>%
sapply(., sd) a b c u
0.3724096 4.9921739 10.0550584 9.0627519
- Advantages:
- It can deal with missing values automatically.
- The output is a matrix, which can be operated with matrix operation.
- Disadvantage:
- Compare to method 1, this command is relatively not straightforward. We need to specify the index vaiable with its dataset name instead of using its own name only (e.g.,
Cushings$TypevsType). We also have to exclude the index variable when splitting the dataset. - Command
function(x) apply(x, 2, mean)can be turned intocolMeansto be simpler.
- Compare to method 1, this command is relatively not straightforward. We need to specify the index vaiable with its dataset name instead of using its own name only (e.g.,
- Other comments
- If we use
lapplyinstead ofsapply, we can do more list operation since the output is a list.
- If we use
Method 3
do.call("rbind", as.list( # bind the list by the rows
by(Cushings, list(Cushings$Type), function(x) {
y <- subset(x, select = -Type) # create subset without the index variable
apply(y, 2, mean) # compute the colunm means
}
))) Tetrahydrocortisone Pregnanetriol
a 2.966667 2.44
b 8.180000 1.12
c 19.720000 5.50
u 14.016667 1.20
# Test for missing value
do.call("rbind", as.list( # bind the list by the rows
by(dta_test, list(dta_test$Type), function(x) {
y <- subset(x, select = -Type) # create subset without the index variable
apply(y, 2, mean) # compute the colunm means
}
))) Tetrahydrocortisone Pregnanetriol
a 2.966667 2.440000
b 8.222222 1.111111
c 19.720000 5.500000
u 14.016667 1.200000
- Advantage:
- It can deal with missing values automatically.
- The output is a matrix, which can be operated with matrix operation.
- Disadvantage:
- This command is not straightforward.
Method 4
Cushings %>%
group_by(Type) %>%
summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))# Test for missing values
dta_test %>%
group_by(Type) %>%
summarize(t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))- Advantages:
- The code with
dplyrcommands is very straightforward. - The names of the new varibles can be specified when being created.
- It can deal with missing values automatically.
- The output is a data table, which can be easily operated with
dplyrandtidyrand be displayed withknitrtable tiddly.
- The code with
Method 5
Cushings %>%
nest(-Type) %>%
mutate(avg = map(data, ~ apply(., 2, mean)),
res_1 = map_dbl(avg, "Tetrahydrocortisone"),
res_2 = map_dbl(avg, "Pregnanetriol")) # Test for missing values
dta_test %>%
nest(-Type) %>%
mutate(avg = map(data, ~ apply(., 2, mean)),
res_1 = map_dbl(avg, "Tetrahydrocortisone"),
res_2 = map_dbl(avg, "Pregnanetriol")) - Advantages:
- The code with
dplyrandtidyrcommands is very straightforward. - The names of the new varibles can be specified when being created.
- The output is a data table, which can be easily operated with
dplyrandtidyrand be displayed withknitrtable tiddly. - Compared to method 4, the output of method 5 extraly show the data type of each group of the index variable.
- The code with
- Disadvantage:
- The code is not as straightforward as method 4.
- It is possible that conflicts take place when using both
dplyrandtidyrcommands.
In-class exercise 4.
Go through the script in the NZ schools example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.
The New Zealand Ministry of Education provides basic information for all primary and secondary schools in the country.
Source: Ministry of education - New Zealand
- Column 1: School ID
- Column 2: School name
- Column 3: City where the school is located
- Column 4: The authority of the school
- Column 5: A socio-economic status of the families of the students of the school
- Column 6: The number of students enrolled at the school as of July 2007
Load in the data set.
Check the structure and the dimension of the data set.
'data.frame': 2571 obs. of 6 variables:
$ ID : int 1015 1052 1062 1092 1130 1018 1029 1030 1588 1154 ...
$ Name: chr "Hora Hora School" "Morningside School" "Onerahi School" "Raurimu Avenue School" ...
$ City: Factor w/ 541 levels "Ahaura","Ahipara",..: 533 533 533 533 533 533 533 533 533 533 ...
$ Auth: Factor w/ 4 levels "Other","Private",..: 3 3 3 3 3 3 3 3 4 3 ...
$ Dec : int 2 3 4 2 4 8 5 5 6 1 ...
$ Roll: int 318 200 455 86 577 329 637 395 438 201 ...
[1] 2571 6
Binning
Sorting
1. Create a new variable RollOrd containing the decreasing order of Roll.
4. Show the top 6 rows of the dataset in the decreasing order of City and Roll.
City is a factorial variable with character type, whose first letter would be used in the alphabet orber when sorting.
Counting
Find the freq. table of variable Auth.
Other Private State State Integrated
1 99 2144 327
Save the freq. table of variable and show it.
Other Private State State Integrated
1 99 2144 327
Aggregating
Compute the Roll mean of rows whose Auth is 'Private'.
[1] 308.798
Group Dec into two groups. Create a binary variable Rich to label this.
[Note] Since the original command dta$Rich would show too many lines, I turn it into table(dta$Rich) for a tidier output.
FALSE TRUE
1276 1274
Compute Roll mean for each cross group of Auth and Rich. The output is a data frame with the long format.
Find the range of Roll (e.g., MIN and MAX) for each group of Auth. The output is a by format.
: Other
[1] 51 51
------------------------------------------------------------
: Private
[1] 7 1663
------------------------------------------------------------
: State
[1] 5 5546
------------------------------------------------------------
: State Integrated
[1] 18 1475
In-class exercise 5.
Go through the script in the NCEA 2007 example and provide comments to each code chunk indicated by '##'. Give alternative code to perform the same calculation where appropriate.
Students’ learning in secondary schools are measured by the National Certificates of Educational Achievement (NCEA) in New Zealand. Students usually try to attain NCEA Level 1 in their third year of secondary schooling, Level 2 in their fourth year, and Level 3 in their fifth and final year of secondary school. The percentage of students who achieved each NCEA level is reported annually for all New Zealand secondary schools. The data set contains NCEA achievement percentages for 2007.
- Column 1: School name
- Column 2: Achievement percentages for Level 1
- Column 3: Achievement percentages for Level 2
- Column 4: Achievement percentages for Level 3
Load and check
Check the structure and the dimension of the data set. Show the first 6 rows of the dataset.
[1] 88 4
'data.frame': 88 obs. of 4 variables:
$ Name : chr "Al-Madinah School" "Alfriston College" "Ambury Park Centre for Riding Therapy" "Aorere College" ...
$ Level1: num 61.5 53.9 33.3 39.5 71.2 22.1 50.8 57.3 89.3 59.8 ...
$ Level2: num 75 44.1 20 50.2 78.9 30.8 34.8 49.8 89.7 65.7 ...
$ Level3: num 0 0 0 30.6 55.5 26.3 48.9 44.6 88.6 50.4 ...
Apply
Apply for one dimension of an array.
Compute the column means (group means). The output is a named vector.
Level1 Level2 Level3
62.26705 61.06818 47.97614
List apply
Compute the group means. The output is a list.
$Level1
[1] 62.26705
$Level2
[1] 61.06818
$Level3
[1] 47.97614
Simplify the list apply
Compute the group means. The output is a named vector.
Level1 Level2 Level3
62.26705 61.06818 47.97614
Splitting
Split the variable dta$Roll by the index variable dta$Auth.
The splited variable is a integer vector. The index variable have 4 levels. The splitting output is a list with 4 elements, which are integer vectors.
[1] "list"
List of 4
$ Other : int 51
$ Private : int [1:99] 255 39 154 73 83 25 95 85 94 729 ...
$ State : int [1:2144] 318 200 455 86 577 329 637 395 201 267 ...
$ State Integrated: int [1:327] 438 26 191 560 151 114 126 171 211 57 ...
Compute Roll mean for each group (level) of Auth. The output is a list with 4 elements.
$Other
[1] 51
$Private
[1] 308.798
$State
[1] 300.6301
$`State Integrated`
[1] 258.3792