DataM: In-class Exercise 0511: Programming
In-class exercise 1.
What does the following R script do? Revise the code so that it does the same job without resorting to the use of a nested ‘for’ loop.
The original script
ci_norm <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05) {
ci <- NULL
z0 = qnorm(1 - (1 - level)/2)
for ( j in 1:nRun ) {
zbar <- colMeans(replicate(nci, rnorm(n)))
zl <- zbar - z0 * 1/sqrt(n); zu <- zbar + z0 * 1/sqrt(n)
plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu)),
type = "n", xlab = "Sample ID", ylab = "Average")
abline(h = 0, lty = 2)
for( i in 1:nci ) {
arrows(i, zl[i], i, zu[i], length = 0.05, angle = 90,
code = 3,
col = ifelse(0 > zl[i] & 0 < zu[i], "gray", "red"))
points(i, zbar[i], pch = 19, col = "gray")
Sys.sleep(pause)
}
Sys.sleep(3)
}
}
ci_norm(n = 36, nci = 51, nRun = 2) # n=sample size, nci=no. of CI, nRun=no. of runThis script uses the animation to demonstrate the process of finding the average value that is significantly different from zero with defaulted \(\alpha=.05\). That is, the 95% confidence interval does not cover zero. The target CI bars are marked with obvious color. Since the output in the markdown would be lots of plots, I do not show the output here. I turn output plots into an animation gif to show the process clearly in one of the following modified scripts.
The modified scripts
- Use several user-defined functions (modularization) and
lapplyinstead of the nestedforloop structure. - Create a new argument
seedsto let function user set seed if she/he wants to keep the same result. - Create a new argument
show_anito let function user set if an animation is generated. Ifshow_ani==TRUE, output plots would be turned into an animation gif to show the process clearly.
Method 1
In this method, commands of package animation are used to save the animation as a gif file and load it in the markdown file out of the R code chunk. This method can show the whole changing process.
Create the functions
ci_norm_single_bar <- function(i, zbar, zl, zu, pause) {
arrows(i, zl[i], i, zu[i], length = 0.05, angle = 90, code = 3,
col = ifelse(0 > zl[i] & 0 < zu[i], "gray", "red"))
points(i, zbar[i], pch = 19, col = "gray")
Sys.sleep(pause)
ani.record(replay.cur=TRUE)
}
ci_norm_bars <- function(r, n, nci, pause, z0) {
zbar <- colMeans(replicate(nci, rnorm(n)))
zl <- zbar - z0 * 1/sqrt(n)
zu <- zbar + z0 * 1/sqrt(n)
plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
xlab = "Sample ID", ylab = "Average")
text(5, max(zu) + .05, paste0('Running iteration = ', r))
abline(h = 0, lty = 2)
lapply(1:nci, function(lst) ci_norm_single_bar(i=lst, zbar=zbar, zl=zl, zu=zu, pause=pause))
}
ci_norm <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), show_ani=TRUE, file_name='test') {
set.seed(seeds)
library(animation)
ani.record(reset = TRUE)
ci <- NULL
z0 <- qnorm(1 - (1 - level)/2)
if (nRun >= 2 & !show_ani) {par(mfrow = c((nRun + 1) %/% 2, 2))}
lapply(1:nRun, function(r) ci_norm_bars(r=r, n=n, nci=nci, z0=z0, pause=pause))
#replicate(nRun, ci_norm_bars(n=n, nci=nci, pause=pause, z0=z0, show_ani=show_ani))
if (show_ani) {
file_name <- paste0('animation_', file_name, '.gif')
saveGIF(ani.replay(), movie.name = file_name)
}
}Method 2
In this method, ffmpeg is used to show every plot in the given R code chunk in an animation format. Saving the animation as a gif file and loading it back is not necessary in this method. However, this method can show the changing process of diffenent running iterations instead of the whole changing process. Note that ffmpeg is not an R package. If you are on a macOS, you can install ffmpeg through Homebrew using the bash command brew install ffmpeg in the terminal. For more details: https://blogdown-demo.rbind.io/2018/01/31/gif-animations/.
Create the functions
(Not used)
ci_norm2 <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), file_name='test') {
set.seed(seeds)
ci <- NULL
z0 <- qnorm(1 - (1 - level)/2)
lapply(1:nRun, function(r) {
zbar <- colMeans(replicate(nci, rnorm(n)))
zl <- zbar - z0 * 1/sqrt(n)
zu <- zbar + z0 * 1/sqrt(n)
lapply(1:nci, function(k) {
plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
xlab = "Sample ID", ylab = "Average")
text(5, max(zu) + .05, paste0('Running iteration = ', r))
abline(h = 0, lty = 2)
lapply(1:k, function(i) ci_norm_single_bar(i=i, zbar=zbar, zl=zl, zu=zu, pause=pause))
})
})
}Use this function currently for saving running time
ci_norm2 <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), file_name='test') {
set.seed(seeds)
ci <- NULL
z0 <- qnorm(1 - (1 - level)/2)
lapply(1:nRun, function(r) {
zbar <- colMeans(replicate(nci, rnorm(n)))
zl <- zbar - z0 * 1/sqrt(n)
zu <- zbar + z0 * 1/sqrt(n)
plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
xlab = "Sample ID", ylab = "Average")
text(5, max(zu) + .05, paste0('Running iteration = ', r))
abline(h = 0, lty = 2)
lapply(1:nci, function(i) ci_norm_single_bar(i=i, zbar=zbar, zl=zl, zu=zu, pause=pause))
})
}In-class exercise 2.
Use the read and math variables from the high schools data example for this problem. First firgure out what the following R script does and then eliminate the for loop in the code segment.
The data set consists of measurements on 11 variables for a sample of 200 students. Source: UCLA Academic Technology Service: R class note.
- Column 1: Student ID
- Column 2: Gender, M = Male, F = Female
- Column 3: Race
- Column 4: Socioeconomic status
- Column 5: School type
- Column 6: Program type
- Column 7: Reading score
- Column 8: Writing score
- Column 9: Math score
- Column 10: Science score
- Column 11: Social science studies score
Load and check
'data.frame': 200 obs. of 11 variables:
$ id : int 70 121 86 141 172 113 50 11 84 48 ...
$ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
$ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
$ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
$ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
$ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
$ read : int 57 68 44 63 47 44 50 34 63 57 ...
$ write : int 52 59 33 44 52 52 59 46 57 55 ...
$ math : int 41 53 54 47 57 51 42 45 54 52 ...
$ science: int 47 63 58 53 53 63 53 39 58 NA ...
$ socst : int 57 61 31 56 61 61 61 36 51 51 ...
The original script
Get the subset of Asian participants
Method 1 of conducting correlation test
First, compute the correlation coefficient of two variables math and socst in dta.asian. Name it r0.
Then, use the following scipt to count how many times that the correlation coefficient of shuffled read is not less than r0. Name the count no. cnt. Compute the proportion of cnt out of nIter times.
cnt <- 0
nIter <- 1001
for (i in 1:nIter) {
new <- sample(dta.asian$read)
r <- cor(new, dta.asian$math)
if ( r0 <= r ) cnt <- cnt+1
}
cnt / nIter[1] 0.04295704
[1] TRUE
Method 2 of conducting correlation test
Shuffle dta.asian$read for nIter times and turn the output into a data frame called newread.
Compute the correlation coefficient of dta.asian$math and each column in newread. Since the output of the original command lapply contains too many to display easily. I do not show it here.
with(newread, lapply(names(newread),
function(x) cor(dta.asian$math, eval(substitute(tmp, list(tmp=as.name(x)))))))I turn lapply into sapply to make the output turned into a vector and show the first 10 elements for easy.
output <- with(newread, sapply(names(newread),
function(x) cor(dta.asian$math, eval(substitute(tmp, list(tmp=as.name(x)))))))
head(output, 10) X1 X2 X3 X4 X5 X6
-0.21891609 0.09966721 -0.05124067 -0.41109792 -0.03834256 -0.09251462
X7 X8 X9 X10
-0.68582764 -0.05639991 -0.04608143 -0.07058783
Compute the proportion that iterated correlation coefficient reach r0.
[1] 0.04295704
[1] TRUE
Method 3 of conducting correlation test (Use built-in cor.test function)
Pearson's product-moment correlation
data: dta[dta$race == "asian", "math"] and dta[dta$race == "asian", "socst"]
t = 1.9887, df = 9, p-value = 0.07796
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07083501 0.86552255
sample estimates:
cor
0.5525177
The modified script for method 1
Define the function
Use the function
[1] 0.036
[1] TRUE
In-class exercise 3.
An example of obtaining (nonparametric) bootstrap estimates of coefficients for a multiple linear regression model for the anorexia{MASS} data set is presented in the following script. Figure out how it works and improve the code.
The original script
Load and check the data set
Treat Prewt Postwt
CBT :29 Min. :70.00 Min. : 71.30
Cont:26 1st Qu.:79.60 1st Qu.: 79.33
FT :17 Median :82.30 Median : 84.05
Mean :82.41 Mean : 85.17
3rd Qu.:86.00 3rd Qu.: 91.55
Max. :94.90 Max. :103.60
[1] 72 3
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
[15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
[29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
[43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
[71] "71" "72"
Linear regression analysis
Call:
lm(formula = Postwt ~ Prewt + Treat, data = dta)
Residuals:
Min 1Q Median 3Q Max
-14.1083 -4.2773 -0.5484 5.4838 15.2922
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.7711 13.3910 3.717 0.00041 ***
Prewt 0.4345 0.1612 2.695 0.00885 **
TreatCont -4.0971 1.8935 -2.164 0.03400 *
TreatFT 4.5631 2.1333 2.139 0.03604 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.978 on 68 degrees of freedom
Multiple R-squared: 0.2777, Adjusted R-squared: 0.2458
F-statistic: 8.713 on 3 and 68 DF, p-value: 5.719e-05
(Intercept) Prewt TreatCont TreatFT
49.7711090 0.4344612 -4.0970655 4.5630627
Use the use-defined function to conduct bootstrap
Create the function
bootstrap_function <- function(string_formula, nc, model_data, ndraws) {
coeff_mtx <- matrix(0, nrow = ndraws, ncol = nc)
for (i in 1:ndraws) {
bootstrap_ids <- sample(seq(nrow(model_data)), nrow(model_data), replace = TRUE)
bootstrap_data <- model_data[bootstrap_ids,]
bootstrap_model <- lm(as.formula(string_formula), bootstrap_data)
coeff_mtx[i,] <- bootstrap_model$coefficients
if (i == 1) {print(bootstrap_model$coefficients)}
}
return(coeff_mtx)
}Use the function to compute the estimate and CI of the linear model coefficients via bootstrap method
(Intercept) Prewt TreatCont TreatFT
37.2243800 0.5579126 0.9651200 5.3044083
bootstrap_CIs <- matrix(0, nrow = 4, ncol = 2)
for (i in 1:4) {
bootstrap_CIs[i,1] <- quantile(bootstrap_estimates[,i], 0.025)
bootstrap_CIs[i,2] <- quantile(bootstrap_estimates[,i], 0.975)
}
print(bootstrap_CIs) [,1] [,2]
[1,] 17.20076434 79.0250832
[2,] 0.07060444 0.8292259
[3,] -7.11445046 -0.2725961
[4,] -0.09495921 8.9724349
2.5 % 97.5 %
(Intercept) 23.0498681 76.4923499
Prewt 0.1128268 0.7560955
TreatCont -7.8754712 -0.3186599
TreatFT 0.3060571 8.8200682
The results of parametric linear regression and bootstrap are not very similar.
Plot the distribution of four linear model coefficients
par(mfrow=c(2,2))
hist(bootstrap_estimates[,1])
hist(bootstrap_estimates[,2])
hist(bootstrap_estimates[,3])
hist(bootstrap_estimates[,4])The modified script
Function modularization: Create the functions
# Function for conducting bootstrap method once
bootstrap_single <- function(string_formula, model_data) {
bootstrap_ids <- sample(seq(nrow(model_data)), nrow(model_data), replace = TRUE)
bootstrap_data <- model_data[bootstrap_ids,]
bootstrap_model <- lm(as.formula(string_formula), bootstrap_data)
return(bootstrap_model$coefficients)
}
# Function for computing bootstrap confidence interval with
# (1) the given bootstrap output and (2) specified confident level
bootstrap_CI <- function(coeff_mtx, level=.95) {
out <- data.frame(apply(coeff_mtx, 2, function(x) quantile(x, (1 - level)/2)),
apply(coeff_mtx, 2, function(x) quantile(x, 1 - (1 - level)/2)))
colnames(out) <- paste0(c('Lower', 'Upper'), ' bound of ', level*100, '% C.I.')
rownames(out) <- colnames(coeff_mtx)
return(out)
}
# Function for conducting bootstrap for given times (default=1000)
bootstrap_function2 <- function(string_formula, model_data, ndraws=1000, CI=FALSE, level=.95) {
coeff_mtx <- t(sapply(1:ndraws, function(i) bootstrap_single(string_formula, model_data)))
if (CI) {return(bootstrap_CI(coeff_mtx=coeff_mtx, level=level))}
else {return(coeff_mtx)}
}
# Function for plotting the distribution of coefficients from the bootstrap output
plot_coeff_distri <- function(coeff_mtx) {
nc <- ncol(coeff_mtx)
par(mfrow=c((nc + 1) %/% 2, 2))
lapply(1:nc, function(k) hist(coeff_mtx[,k], freq = FALSE, # use 'density' instead of 'frequency' as y
main = paste0('Histogram of coeff ', colnames(bootstrap_estimates)[k]),
xlab = colnames(bootstrap_estimates)[k]))
}Use the functions
- Conduct bootstrap method. The output is a matrix.
set.seed(1234)
bootstrap_estimates <- bootstrap_function2(string_formula="Postwt ~ Prewt + Treat",
model_data=dta, ndraws=500)
dim(bootstrap_estimates)[1] 500 4
(Intercept) Prewt TreatCont TreatFT
[1,] 48.51441 0.4464814 -2.7719579 5.646139
[2,] 34.15512 0.5902138 -2.7889033 7.975395
[3,] 12.42604 0.8788593 -0.7156288 1.850311
[4,] 54.72025 0.3719938 -1.8728123 8.995408
[5,] 71.62591 0.1737466 -3.7647641 6.775371
[6,] 39.65098 0.5454511 -4.1275988 5.411284
- Plot the distribution of coefficients from the bootstrap output.
- Use
bootstrap_function2to obtain 95% bootstrap CI of different bootstrap times.
2.5 % 97.5 %
(Intercept) 23.0498681 76.4923499
Prewt 0.1128268 0.7560955
TreatCont -7.8754712 -0.3186599
TreatFT 0.3060571 8.8200682
set.seed(1234)
lst_result <- lapply(c(500, 1000, 3000, 5000, 10000),
function(n) bootstrap_function2(string_formula="Postwt ~ Prewt + Treat",
model_data=dta, ndraws=n, CI=TRUE))
names(lst_result) <- c(500, 1000, 3000, 5000, 10000)
lst_result$anrx_lm <- data.frame(confint(anrx_lm))
df_result <- lapply(lst_result, function(df) {
colnames(df) <- c('lb', 'ub')
df %>% mutate(estimate = rowMeans(df), var = rownames(df))}) %>%
data.table::rbindlist() %>%
mutate(ndraws = factor(rep(names(lst_result), each = 4),
levels = c(500, 1000, 3000, 5000, 10000, 'anrx_lm')))
df_result- Plot to compare the results of different bootstrap times.
df_result %>% dplyr::filter(var == '(Intercept)') %>%
ggplot(mapping = aes(x = ndraws, y = estimate), data = .) +
geom_point(aes(color = ndraws)) +
geom_errorbar(aes(ymax = ub, ymin = lb), width=.3, size=.2) + theme_bw() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45, margin = margin(t=8)))df_result %>% dplyr::filter(var != '(Intercept)') %>%
ggplot(mapping = aes(x = ndraws, y = estimate), data = .) +
facet_wrap(. ~ var) +
geom_point(aes(color = ndraws)) +
geom_errorbar(aes(ymax = ub, ymin = lb), width=.3, size=.2) + theme_bw() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45, margin = margin(t=8)))There are no obvious differences of coefficients among different bootstrap times (ndraws) and the parametric linear model. However, as ndraws is large enough, the CIs act as what CI of the parametric linear model does – do not cover zero.
In-class exercise 4.
What does the following R script do? Replace the while loop in the code with a different iteration control structure.
The original script
Brownian <- function(n = 11, pause = 0.05, nIter = 512, ...) {
x = rnorm(n)
y = rnorm(n)
i = 1
while (i <= nIter) {
plot(x, y, ...)
text(x, y, cex = 0.5)
x = x + rnorm(n)
y = y + rnorm(n)
Sys.sleep(pause)
i = i + 1
}
}
###
## test it
Brownian(xlim = c(-20, 20), ylim = c(-20, 20), nIter = 10,
pch = 21, cex = 2, col = "cyan", bg = "lavender") This script aims to demonstrate Brownian motion. Give the sample size n and the interation number nIter, the function Brownian would plot n points with text labels and random position following the standard normal distribution. More detailed arguments of plot can also be assigned in Brownian. Just like inclass exercise 1, the output in the markdown would be lots of plots. Thus, I do not show the output here. I turn output plots into an animation using ffmpeg to show the point-moving process clearly in one of the following modified scripts.
The modified script
Create the function
Brownian <- function(n=11, nIter=20, pause=0.05, ...) {
xy <- replicate(2, apply(replicate(nIter, rnorm(n)), 1, cumsum))
b <- 20 + ((nIter - 200)*(nIter > 200)) %/% 10
invisible(lapply(1:nIter, function(i) {
x <- xy[i,,1]; y <- xy[i,,2]
par(pty = 's', mar = c(4, 2, 1, 1))
plot(x, y, xlim = c(-b, b), ylim = c(-b, b), ...)
text(x, y, cex = .6)
text(-b + 10, b - 3, paste0('Iteration: ', i))
Sys.sleep(pause)
}))
} In-class exercise 5.
Here is an example of simulation using R. Figure out how it works and improve the code.
This example shows how to sample, at random, from a population of 50.5% men and 49.5% women, in which height is normally distributed with a mean of 170 cm and a standard deviation of 7 cm for men, with corresponding values 160 and 5 for women. The output is a data frame, with the first column showing gender (M for male, F for female) and the second column showing height.
The original script
simSexHeight <- function(n) {
m <- matrix(c(runif(n), rnorm(n)), ncol=2)
draw <- function(mr) { # mr = one row of m
if (mr[1] < 0.505) {
sex <- 1
cm <- 170 + 7*mr[2]
} else {
sex <- 0
cm <- 160 + 5*mr[2]
}
return(c(sex, cm))
}
person <- t(apply(m, 1, draw))
person <- as.data.frame(person)
person[, 1] <- ifelse(person[, 1]==1, "M", "F")
names(person) <- c("Gender", "Height")
return(person)
}
simSexHeight(5) [,1] [,2] [,3] [,4]
F 0.47 0.508 0.499 0.4986
M 0.53 0.492 0.501 0.5014
This script creates a function which can generate a sythetic data set of gender and height. The random variable gender follows the uniform distribution and height follows the standard normal distribution. The proportion of male is set around \(50.5\%\).
The modified script
- Create an index vector
gender_idxfor gender for setting gender lable (i.e., F or M) and height formula. - Use the normal distribution with different parameters (i.e., \(\mu\) and \(\sigma\)) to replace if-else structure.
- Create a new argument
male_proportionto let the user assign the proportion of male, instead of fixing at \(50.5\%\).
Create the function
simSexHeight <- function(n, male_proportion = .505, round_digits = 2) {
gender_idx <- runif(n) < male_proportion
df <- data.frame(Gender = c('F', 'M')[gender_idx*1 + 1])
df$Height[gender_idx] <- round(rnorm(sum(gender_idx), 170, 7), round_digits)
df$Height[!gender_idx] <- round(rnorm(sum(!gender_idx), 160, 5), round_digits)
return(df)
}Use the function
Check the distribution of height
Check the proportion of males
lst <- lapply(c(.1, .6, .8), function(p) {
m <- sapply(c(100, 500, 1000, 5000),
function(n) table(simSexHeight(n, p)$Gender) / n)
colnames(m) <- paste0('n = ', c(100, 500, 1000, 5000))
return(m)
})
names(lst) <- paste0('p = ', c(0.1, 0.6, 0.8))
lst$`p = 0.1`
n = 100 n = 500 n = 1000 n = 5000
F 0.83 0.912 0.902 0.9056
M 0.17 0.088 0.098 0.0944
$`p = 0.6`
n = 100 n = 500 n = 1000 n = 5000
F 0.42 0.342 0.371 0.3986
M 0.58 0.658 0.629 0.6014
$`p = 0.8`
n = 100 n = 500 n = 1000 n = 5000
F 0.15 0.226 0.197 0.2008
M 0.85 0.774 0.803 0.7992