Functions exercise 1-5
Exercise 1
Split the ChickWeight{datasets} data by individual chicks to extract separate slope estimates of regressing weight onto Time for each chick.
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
df1.r<-df1%>%split(., .$Chick)%>%lapply(., function(x) coef(lm(x$Time~x$weight)))%>%
do.call(rbind, .) # bind the list by row
head(df1.r)## (Intercept) x$weight
## 18 19.500000 -0.5000000
## 16 -27.825371 0.6803954
## 15 -18.408015 0.4225865
## 13 -17.631950 0.4208641
## 9 -14.574242 0.3140564
## 20 -9.888271 0.2653127
Exercise 2
Explain what does this statement do:“lapply(lapply(search(), ls), length)”
## [[1]]
## [1] 2
##
## [[2]]
## [1] 165
##
## [[3]]
## [1] 267
##
## [[4]]
## [1] 448
##
## [[5]]
## [1] 87
##
## [[6]]
## [1] 110
##
## [[7]]
## [1] 245
##
## [[8]]
## [1] 104
##
## [[9]]
## [1] 218
##
## [[10]]
## [1] 0
##
## [[11]]
## [1] 1232
search():Gives a list of attached packages (see library), and R objects, usually data.frames.ls: list objectiveslapply(lapply(search(), ls), length): to get lists of length of each attached packages and R objects
Exercise 3
The following R script uses Cushings{MASS} to demonstrates several ways to achieve the same objective in R. Explain the advantages or disadvantages of each method. The following synax aims to get the mean of Tetrahydrocortisone and Pregnanetriol in each type using dataset Cushings
Method 1. using aggregate.
baseR, easy to understand and learn, the result can be saved as dataframe directly. It will return the means of two variables in dataset Cushings of each type by row. It is suitable for further data explantory analysis while using ggplot 4 row, 3 column
## 'data.frame': 4 obs. of 3 variables:
## $ Type : Factor w/ 4 levels "a","b","c","u": 1 2 3 4
## $ Tetrahydrocortisone: num 2.97 8.18 19.72 14.02
## $ Pregnanetriol : num 2.44 1.12 5.5 1.2
## [1] 4 3
Method 2. using split and the apply family function .
split data by type then apply the mean function to each list by column then return as a vector It’s easy to understand. It allows to using user-defined function in the function() sapply. 2 row, 4 column
## num [1:2, 1:4] 2.97 2.44 8.18 1.12 19.72 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "Tetrahydrocortisone" "Pregnanetriol"
## ..$ : chr [1:4] "a" "b" "c" "u"
## [1] 2 4
Method 3. using subset, rbind, rbind function.
using rbind to bind list by row. It is similar to but much complex than method 2 4 row, 2 column
# method 3
do.call("rbind", as.list(
by(Cushings, list(Cushings$Type), function(x) {
y <- subset(x, select = -Type)
apply(y, 2, mean)
}
))) ## Tetrahydrocortisone Pregnanetriol
## a 2.966667 2.44
## b 8.180000 1.12
## c 19.720000 5.50
## u 14.016667 1.20
Method 4. using forward pipe and dplyr package
first to group_by type then summarize the meand of tow other variables as t_m and p_m.
This method has better readability than method 2, 3 and 5. Result will be saved as tibble, data.frame and table. 4 row, 3 column
# method 4
Cushings %>% group_by(Type) %>% summarize( t_m = mean(Tetrahydrocortisone), p_m = mean(Pregnanetriol))## # A tibble: 4 x 3
## Type t_m p_m
## <fct> <dbl> <dbl>
## 1 a 2.97 2.44
## 2 b 8.18 1.12
## 3 c 19.7 5.5
## 4 u 14.0 1.2
Method 5. using forward pipe and nest, map function
nest function can create list-column dataframe contain all the nested variables. apply the function to each element by using map, return a list same length as dataset.map_dbl to return double vector. It return Classes ‘tbl_df’, ‘tbl’ and ‘data.frame’ and 4 row, 5 column.
*avg is a list of mean for each variable in each group. res_1 is mean of Tetrahydrocortisone in each group, saved as vector and res_2 is mean of Pregnanetriol in each group, also saved as vector.
# method 5
Cushings %>% nest(-Type) %>% mutate(avg = map(data, ~ apply(., 2, mean)), res_1 = map_dbl(avg, "Tetrahydrocortisone"), res_2 = map_dbl(avg, "Pregnanetriol")) ## Warning: All elements of `...` must be named.
## Did you want `data = c(Tetrahydrocortisone, Pregnanetriol)`?
## # A tibble: 4 x 5
## Type data avg res_1 res_2
## <fct> <list> <list> <dbl> <dbl>
## 1 a <tibble [6 x 2]> <dbl [2]> 2.97 2.44
## 2 b <tibble [10 x 2]> <dbl [2]> 8.18 1.12
## 3 c <tibble [5 x 2]> <dbl [2]> 19.7 5.5
## 4 u <tibble [6 x 2]> <dbl [2]> 14.0 1.2
Conclusion
Readability: Method 1=Method 4>Method 2>Method 3= Method 5.
Exercise 4.
Go through the script in the NZ schools example and provide comments to each code chunk indicated by ‘##’. Give alternative code to perform the same calculation where appropriate.
#
# a case study
#
## keep the school names with white spaces
dta <- read.csv("C:/Users/USER/Desktop/R_data management/0427/nzSchools.csv", as.is=2)
## check data structure
str(dta) ## 'data.frame': 2571 obs. of 6 variables:
## $ ID : int 1015 1052 1062 1092 1130 1018 1029 1030 1588 1154 ...
## $ Name: chr "Hora Hora School" "Morningside School" "Onerahi School" "Raurimu Avenue School" ...
## $ City: Factor w/ 541 levels "Ahaura","Ahipara",..: 533 533 533 533 533 533 533 533 533 533 ...
## $ Auth: Factor w/ 4 levels "Other","Private",..: 3 3 3 3 3 3 3 3 4 3 ...
## $ Dec : int 2 3 4 2 4 8 5 5 6 1 ...
## $ Roll: int 318 200 455 86 577 329 637 395 438 201 ...
## [1] 2571 6
## binning
## create a variable size value of roll > median of roll then size="Large", others size="Small"
dta$Size <- ifelse(dta$Roll > median(dta$Roll), "Large", "Small")
## assign size=null
dta$Size <- NULL
## show first 6 rows of dta
head(dta) ## ID Name City Auth Dec Roll
## 1 1015 Hora Hora School Whangarei State 2 318
## 2 1052 Morningside School Whangarei State 3 200
## 3 1062 Onerahi School Whangarei State 4 455
## 4 1092 Raurimu Avenue School Whangarei State 2 86
## 5 1130 Whangarei School Whangarei State 4 577
## 6 1018 Hurupaki School Whangarei State 8 329
## using cut function to divide variable Roll to 3 parts and give lables
dta$Size <- cut(dta$Roll, 3, labels=c("Small", "Mediam", "Large"))
## summarize the count of each category of variable Size
table(dta$Size) #----(1) ##
## Small Mediam Large
## 2555 15 1
## # A tibble: 3 x 2
## Size n
## <fct> <int>
## 1 Small 2555
## 2 Mediam 15
## 3 Large 1
## sorting
## create new variable RollOrd, show the ranking of roll by decreasing
dta$RollOrd <- order(dta$Roll, decreasing=T)
## show first 6 rows with higher ranking of roll
head(dta[dta$RollOrd, ]) ## ID Name City Auth Dec Roll Size RollOrd
## 1726 498 Correspondence School Wellington State NA 5546 Large 753
## 301 28 Rangitoto College Auckland State 10 3022 Mediam 353
## 376 78 Avondale College Auckland State 4 2613 Mediam 712
## 2307 319 Burnside High School Christchurch State 8 2588 Mediam 709
## 615 41 Macleans College Auckland State 10 2476 Mediam 1915
## 199 43 Massey High School Auckland State 5 2452 Mediam 1683
## ID Name City Auth Dec Roll Size
## 2401 1641 Amana Christian School Dunedin Private 9 7 Small
## 1590 2461 Tangimoana School Manawatu State 4 6 Small
## 1996 3598 Woodbank School Kaikoura State 4 6 Small
## 2112 3386 Jacobs River School Jacobs River State 5 6 Small
## 1514 2407 Ngamatapouri School Sth Taranaki District State 9 5 Small
## 1575 2420 Papanui Junction School Taihape State 5 5 Small
## RollOrd
## 2401 2562
## 1590 266
## 1996 2478
## 2112 1501
## 1514 2377
## 1575 1542
## show first 6 rows with higher ranking of city and roll
head(dta[order(dta$City, dta$Roll, decreasing=T), ]) ## ID Name City Auth Dec Roll Size RollOrd
## 2548 401 Menzies College Wyndham State 4 356 Small 859
## 2549 4054 Wyndham School Wyndham State 5 94 Small 1163
## 1611 2742 Woodville School Woodville State 3 147 Small 726
## 1630 2640 Papatawa School Woodville State 7 27 Small 2273
## 2041 3600 Woodend School Woodend State 9 375 Small 1401
## 1601 399 Central Southland College Winton State 7 549 Small 450
## show last 6 rows with higher ranking of city and roll
tail(dta[order(dta$City, dta$Roll, decreasing=T), ]) ## ID Name City Auth Dec Roll Size RollOrd
## 2169 3273 Albury School Albury State 8 30 Small 1010
## 2018 350 Akaroa Area School Akaroa State 8 125 Small 1051
## 2023 3332 Duvauchelle School Akaroa State 9 41 Small 749
## 335 1200 Ahuroa School Ahuroa State 7 22 Small 193
## 99 1000 Ahipara School Ahipara State 3 241 Small 1963
## 2117 2105 Awahono School - Grey Valley Ahaura State 4 119 Small 364
##
## Other Private State State Integrated
## 1 99 2144 327
## # A tibble: 4 x 2
## Auth n
## <fct> <int>
## 1 Other 1
## 2 Private 99
## 3 State 2144
## 4 State Integrated 327
##
## Other Private State State Integrated
## 1 99 2144 327
## [1] "table"
## ID Name City Auth Dec Roll Size RollOrd
## 2315 518 Kingslea School Christchurch Other 1 51 Small 1579
## Dec
## Auth 1 2 3 4 5 6 7 8 9 10
## Other 1 0 0 0 0 0 0 0 0 0
## Private 0 0 2 6 2 2 6 11 12 38
## State 259 230 208 219 214 215 188 200 205 205
## State Integrated 12 22 35 28 38 34 45 45 37 31
##
## 1 2 3 4 5 6 7 8 9 10
## Other 1 0 0 0 0 0 0 0 0 0
## Private 0 0 2 6 2 2 6 11 12 38
## State 259 230 208 219 214 215 188 200 205 205
## State Integrated 12 22 35 28 38 34 45 45 37 31
## [1] 295.4737
## [1] 295.4737
## [1] 308.798
## mean
## 1 308.798
# average of roll in each category of Auth
aggregate(dta["Roll"], by=list(dta$Auth), FUN=mean) #--(6)## Group.1 Roll
## 1 Other 51.0000
## 2 Private 308.7980
## 3 State 300.6301
## 4 State Integrated 258.3792
## # A tibble: 4 x 2
## Auth mu
## <fct> <dbl>
## 1 Other 51
## 2 Private 309.
## 3 State 301.
## 4 State Integrated 258.
## create new variable Rich if Dec> 5 than Rich=True, else Rich=FALSE
dta$Rich <- dta$Dec > 5;
# dta$Rich
head(dta$Rich) ## [1] FALSE FALSE FALSE FALSE FALSE TRUE
## average of roll in each category of Auth and rich
aggregate(dta["Roll"], by=list(dta$Auth, dta$Rich), FUN=mean) #--(7) ## Group.1 Group.2 Roll
## 1 Other FALSE 51.0000
## 2 Private FALSE 151.4000
## 3 State FALSE 261.7487
## 4 State Integrated FALSE 183.2370
## 5 Private TRUE 402.5362
## 6 State TRUE 338.8243
## 7 State Integrated TRUE 311.2135
## Rich
## Auth FALSE TRUE
## Other 51.0000
## Private 151.4000 402.5362
## State 261.7487 338.8243
## State Integrated 183.2370 311.2135
## to find the first and the last element position in variable Roll in each category of Auth
by(dta["Roll"], INDICES=list(dta$Auth), FUN=range) #--(8)## : Other
## [1] 51 51
## ------------------------------------------------------------
## : Private
## [1] 7 1663
## ------------------------------------------------------------
## : State
## [1] 5 5546
## ------------------------------------------------------------
## : State Integrated
## [1] 18 1475
## Group.1 Roll.1 Roll.2
## 1 Other 51 51
## 2 Private 7 1663
## 3 State 5 5546
## 4 State Integrated 18 1475
Exercise 5
Go through the script in the NCEA 2007 example and provide comments to each code chunk indicated by ‘##’. Give alternative code to perform the same calculation where appropriate.
#
# a case study - II
#
##
dta2 <- read.table("C:/Users/USER/Desktop/R_data management/0427/NCEA2007.txt", sep=":", quote="", h=T, as.is=T)
## check data dimension
dim(dta2) ## [1] 88 4
## 'data.frame': 88 obs. of 4 variables:
## $ Name : chr "Al-Madinah School" "Alfriston College" "Ambury Park Centre for Riding Therapy" "Aorere College" ...
## $ Level1: num 61.5 53.9 33.3 39.5 71.2 22.1 50.8 57.3 89.3 59.8 ...
## $ Level2: num 75 44.1 20 50.2 78.9 30.8 34.8 49.8 89.7 65.7 ...
## $ Level3: num 0 0 0 30.6 55.5 26.3 48.9 44.6 88.6 50.4 ...
## Name Level1 Level2 Level3
## 1 Al-Madinah School 61.5 75.0 0.0
## 2 Alfriston College 53.9 44.1 0.0
## 3 Ambury Park Centre for Riding Therapy 33.3 20.0 0.0
## 4 Aorere College 39.5 50.2 30.6
## 5 Auckland Girls' Grammar School 71.2 78.9 55.5
## 6 Auckland Grammar 22.1 30.8 26.3
## Level1 Level2 Level3
## 62.26705 61.06818 47.97614
## $Level1
## [1] 62.26705
##
## $Level2
## [1] 61.06818
##
## $Level3
## [1] 47.97614
## Level1 Level2 Level3
## 62.26705 61.06818 47.97614
## Level1 Level2 Level3
## [1,] 2.8 0.0 0.0
## [2,] 97.4 95.7 95.7
## $Level1
## [1] 2.8 97.4
##
## $Level2
## [1] 0.0 95.7
##
## $Level3
## [1] 0.0 95.7
## Level1 Level2 Level3
## [1,] 2.8 0.0 0.0
## [2,] 97.4 95.7 95.7
## splitting
## split Roll by Auth
rollsByAuth <- split(dta$Roll, dta$Auth)
## check data structure
str(rollsByAuth) ## List of 4
## $ Other : int 51
## $ Private : int [1:99] 255 39 154 73 83 25 95 85 94 729 ...
## $ State : int [1:2144] 318 200 455 86 577 329 637 395 201 267 ...
## $ State Integrated: int [1:327] 438 26 191 560 151 114 126 171 211 57 ...
## [1] "list"
## mean of Roll in each category od Auth, return list
lapply(split(dta$Roll, dta$Auth), mean) #--(1)## $Other
## [1] 51
##
## $Private
## [1] 308.798
##
## $State
## [1] 300.6301
##
## $`State Integrated`
## [1] 258.3792
## Group.1 Roll
## 1 Other 51.0000
## 2 Private 308.7980
## 3 State 300.6301
## 4 State Integrated 258.3792
## Other Private State State Integrated
## 51.0000 308.7980 300.6301 258.3792