library(dplyr)
library(ggplot2)

In-class exercise 1.

What does the following R script do? Revise the code so that it does the same job without resorting to the use of a nested ‘for’ loop.

The original script

ci_norm <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05) {
  ci <- NULL
  z0 = qnorm(1 - (1 - level)/2)
  for ( j in 1:nRun ) {
      zbar <- colMeans(replicate(nci, rnorm(n)))     
      zl <- zbar - z0 * 1/sqrt(n); zu <- zbar + z0 * 1/sqrt(n)
      plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu)), 
           type = "n", xlab = "Sample ID", ylab = "Average") 
      abline(h = 0, lty = 2)
      for( i in 1:nci ) {
          arrows(i, zl[i], i, zu[i], length = 0.05, angle = 90, 
               code = 3,
               col = ifelse(0 > zl[i] & 0 < zu[i], "gray", "red"))
          points(i, zbar[i], pch = 19, col = "gray")
          Sys.sleep(pause)
      }
      Sys.sleep(3)
   }
}

ci_norm(n = 36, nci = 51, nRun = 2) # n=sample size, nci=no. of CI, nRun=no. of run

This script uses the animation to demonstrate the process of finding the average value that is significantly different from zero with defaulted $\alpha=.05$. That is, the 95% confidence interval does not cover zero. The target CI bars are marked with obvious color. Since the output in the markdown would be lots of plots, I do not show the output here. I turn output plots into an animation gif to show the process clearly in one of the following modified scripts.

The modified scripts

Use several user-defined functions (modularization) and lapply instead of the nested for loop structure.
Create a new argument seeds to let function user set seed if she/he wants to keep the same result.
Create a new argument show_ani to let function user set if an animation is generated. If show_ani==TRUE, output plots would be turned into an animation gif to show the process clearly.

Method 1

In this method, commands of package animation are used to save the animation as a gif file and load it in the markdown file out of the R code chunk. This method can show the whole changing process.

Create the functions

ci_norm_single_bar <- function(i, zbar, zl, zu, pause) {
  arrows(i, zl[i], i, zu[i], length = 0.05, angle = 90, code = 3,
         col = ifelse(0 > zl[i] & 0 < zu[i], "gray", "red"))
  points(i, zbar[i], pch = 19, col = "gray")
  Sys.sleep(pause)
  ani.record(replay.cur=TRUE)
}

ci_norm_bars <- function(r, n, nci, pause, z0) {
  zbar <- colMeans(replicate(nci, rnorm(n)))     
  zl <- zbar - z0 * 1/sqrt(n)
  zu <- zbar + z0 * 1/sqrt(n)
  plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
         xlab = "Sample ID", ylab = "Average")
  text(5, max(zu) + .05, paste0('Running iteration = ', r))
  abline(h = 0, lty = 2)
  lapply(1:nci, function(lst) ci_norm_single_bar(i=lst, zbar=zbar, zl=zl, zu=zu, pause=pause))
}

ci_norm <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), show_ani=TRUE, file_name='test') {
  set.seed(seeds)
  library(animation)
  ani.record(reset = TRUE)
  ci <- NULL
  z0 <- qnorm(1 - (1 - level)/2)
  if (nRun >= 2 & !show_ani) {par(mfrow = c((nRun + 1) %/% 2, 2))}
  lapply(1:nRun, function(r) ci_norm_bars(r=r, n=n, nci=nci, z0=z0, pause=pause))
  #replicate(nRun, ci_norm_bars(n=n, nci=nci, pause=pause, z0=z0, show_ani=show_ani))
  if (show_ani) {
    file_name <- paste0('animation_', file_name, '.gif')
    saveGIF(ani.replay(), movie.name = file_name)
  }
}

Use the function

ci_norm(n = 50, nci = 21, nRun = 4, file_name = 'test1')

ci_norm(n = 50, nci = 31, nRun = 8, pause = .01, file_name = 'test2')

ci_norm(n = 50, nci = 21, nRun = 4, show_ani=FALSE)

Method 2

In this method, ffmpeg is used to show every plot in the given R code chunk in an animation format. Saving the animation as a gif file and loading it back is not necessary in this method. However, this method can show the changing process of diffenent running iterations instead of the whole changing process. Note that ffmpeg is not an R package. If you are on a macOS, you can install ffmpeg through Homebrew using the bash command brew install ffmpeg in the terminal. For more details: https://blogdown-demo.rbind.io/2018/01/31/gif-animations/.

Create the functions

(Not used)

ci_norm2 <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), file_name='test') {
  set.seed(seeds)
  ci <- NULL
  z0 <- qnorm(1 - (1 - level)/2)
  lapply(1:nRun, function(r) {
    zbar <- colMeans(replicate(nci, rnorm(n)))     
    zl <- zbar - z0 * 1/sqrt(n)
    zu <- zbar + z0 * 1/sqrt(n)
    lapply(1:nci, function(k) {
      plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
           xlab = "Sample ID", ylab = "Average")
      text(5, max(zu) + .05, paste0('Running iteration = ', r))
      abline(h = 0, lty = 2)
      lapply(1:k, function(i) ci_norm_single_bar(i=i, zbar=zbar, zl=zl, zu=zu, pause=pause))
    })
  })
}

Use this function currently for saving running time

ci_norm2 <- function(n=100, nci=50, nRun=3, level=0.95, pause=0.05, seeds=sample(1:100, 1), file_name='test') {
  set.seed(seeds)
  ci <- NULL
  z0 <- qnorm(1 - (1 - level)/2)
  lapply(1:nRun, function(r) {
    zbar <- colMeans(replicate(nci, rnorm(n)))     
    zl <- zbar - z0 * 1/sqrt(n)
    zu <- zbar + z0 * 1/sqrt(n)
    plot(1, xlim = c(0.5, nci + 0.5), ylim = c(min(zl), max(zu) + .1), type = "n",
         xlab = "Sample ID", ylab = "Average")
    text(5, max(zu) + .05, paste0('Running iteration = ', r))
    abline(h = 0, lty = 2)
    lapply(1:nci, function(i) ci_norm_single_bar(i=i, zbar=zbar, zl=zl, zu=zu, pause=pause))
  })
}

Use the function

ci_norm2(n = 50, nci = 11, nRun = 2)

ci_norm2(n = 50, nci = 41, nRun = 5)

In-class exercise 2.

Use the read and math variables from the high schools data example for this problem. First firgure out what the following R script does and then eliminate the for loop in the code segment.

The data set consists of measurements on 11 variables for a sample of 200 students. Source: UCLA Academic Technology Service: R class note.

Column 1: Student ID
Column 2: Gender, M = Male, F = Female
Column 3: Race
Column 4: Socioeconomic status
Column 5: School type
Column 6: Program type
Column 7: Reading score
Column 8: Writing score
Column 9: Math score
Column 10: Science score
Column 11: Social science studies score

Load and check

dta <- read.table('../data/high_schools.txt', header = TRUE)
str(dta)

'data.frame':   200 obs. of  11 variables:
 $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
 $ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
 $ race   : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
 $ ses    : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
 $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
 $ prog   : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
 $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
 $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
 $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
 $ science: int  47 63 58 53 53 63 53 39 58 NA ...
 $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

head(dta)

The original script

Get the subset of Asian participants

dta.asian <- subset(dta, race=="asian")

Method 1 of conducting correlation test

First, compute the correlation coefficient of two variables math and socst in dta.asian. Name it r0.

r0 <- cor(dta.asian$math, dta.asian$socst)

Then, use the following scipt to count how many times that the correlation coefficient of shuffled read is not less than r0. Name the count no. cnt. Compute the proportion of cnt out of nIter times.

cnt <- 0
nIter <- 1001
for (i in 1:nIter) {
    new <- sample(dta.asian$read)
    r <- cor(new, dta.asian$math)
    if ( r0 <= r ) cnt <- cnt+1
}
cnt / nIter

[1] 0.04295704

cnt / nIter < .05

[1] TRUE

Method 2 of conducting correlation test

Shuffle dta.asian$read for nIter times and turn the output into a data frame called newread.

newread <- replicate(nIter, sample(dta.asian$read))
newread <- data.frame(unlist(newread))

Compute the correlation coefficient of dta.asian$math and each column in newread. Since the output of the original command lapply contains too many to display easily. I do not show it here.

with(newread, lapply(names(newread), 
                     function(x) cor(dta.asian$math, eval(substitute(tmp, list(tmp=as.name(x)))))))

I turn lapply into sapply to make the output turned into a vector and show the first 10 elements for easy.

output <- with(newread, sapply(names(newread), 
                               function(x) cor(dta.asian$math, eval(substitute(tmp, list(tmp=as.name(x)))))))
head(output, 10)

         X1          X2          X3          X4          X5          X6 
-0.21891609  0.09966721 -0.05124067 -0.41109792 -0.03834256 -0.09251462 
         X7          X8          X9         X10 
-0.68582764 -0.05639991 -0.04608143 -0.07058783

Compute the proportion that iterated correlation coefficient reach r0.

mean(output >= r0)

[1] 0.04295704

mean(output >= r0) < .05

[1] TRUE

Method 3 of conducting correlation test (Use built-in `cor.test` function)

cor.test(dta[dta$race=="asian", "math"], dta[dta$race=="asian", "socst"])


    Pearson's product-moment correlation

data:  dta[dta$race == "asian", "math"] and dta[dta$race == "asian", "socst"]
t = 1.9887, df = 9, p-value = 0.07796
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.07083501  0.86552255
sample estimates:
      cor 
0.5525177

The modified script for method 1

Define the function

my_cor_test <- function(var1, var2, nIter=1000) {
  cnt <- 0
  r0 <- cor(var1,var2)
  sapply(1:nIter, function(i) {
    new <- sample(var2)
    r <- cor(new, var1)
    return(r >= r0)
  })
}

Use the function

mean(my_cor_test(var1=dta.asian$math, dta.asian$read))

[1] 0.036

mean(my_cor_test(var1=dta.asian$math, dta.asian$read)) < .05

[1] TRUE

In-class exercise 3.

An example of obtaining (nonparametric) bootstrap estimates of coefficients for a multiple linear regression model for the anorexia{MASS} data set is presented in the following script. Figure out how it works and improve the code.

The original script

Load and check the data set

data(anorexia, package="MASS")
dta <- anorexia
summary(dta)

  Treat        Prewt           Postwt      
 CBT :29   Min.   :70.00   Min.   : 71.30  
 Cont:26   1st Qu.:79.60   1st Qu.: 79.33  
 FT  :17   Median :82.30   Median : 84.05  
           Mean   :82.41   Mean   : 85.17  
           3rd Qu.:86.00   3rd Qu.: 91.55  
           Max.   :94.90   Max.   :103.60

dim(dta)

[1] 72  3

row.names(dta)

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
[29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
[43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
[71] "71" "72"

Linear regression analysis

anrx_lm <- lm(Postwt ~ Prewt + Treat, data=dta)
summary(anrx_lm)


Call:
lm(formula = Postwt ~ Prewt + Treat, data = dta)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1083  -4.2773  -0.5484   5.4838  15.2922 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.7711    13.3910   3.717  0.00041 ***
Prewt         0.4345     0.1612   2.695  0.00885 ** 
TreatCont    -4.0971     1.8935  -2.164  0.03400 *  
TreatFT       4.5631     2.1333   2.139  0.03604 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.978 on 68 degrees of freedom
Multiple R-squared:  0.2777,    Adjusted R-squared:  0.2458 
F-statistic: 8.713 on 3 and 68 DF,  p-value: 5.719e-05

coef(anrx_lm)

(Intercept)       Prewt   TreatCont     TreatFT 
 49.7711090   0.4344612  -4.0970655   4.5630627

Use the use-defined function to conduct bootstrap

Create the function

bootstrap_function  <- function(string_formula, nc, model_data, ndraws) {
  coeff_mtx <- matrix(0, nrow = ndraws,  ncol = nc)
  for (i in 1:ndraws) {
    bootstrap_ids   <- sample(seq(nrow(model_data)), nrow(model_data), replace = TRUE)
    bootstrap_data <- model_data[bootstrap_ids,]
    bootstrap_model <- lm(as.formula(string_formula), bootstrap_data)
    coeff_mtx[i,]   <- bootstrap_model$coefficients
    if (i == 1) {print(bootstrap_model$coefficients)}
  }     
  return(coeff_mtx)
}

Use the function to compute the estimate and CI of the linear model coefficients via bootstrap method

bootstrap_estimates <- bootstrap_function("Postwt ~ Prewt + Treat", nc=4, dta, 500)

(Intercept)       Prewt   TreatCont     TreatFT 
 37.2243800   0.5579126   0.9651200   5.3044083

bootstrap_CIs <- matrix(0, nrow = 4, ncol = 2)
for (i in 1:4) {
  bootstrap_CIs[i,1]  <- quantile(bootstrap_estimates[,i], 0.025)
  bootstrap_CIs[i,2]  <- quantile(bootstrap_estimates[,i], 0.975)
}
print(bootstrap_CIs)

            [,1]       [,2]
[1,] 17.20076434 79.0250832
[2,]  0.07060444  0.8292259
[3,] -7.11445046 -0.2725961
[4,] -0.09495921  8.9724349

confint(anrx_lm)

                 2.5 %     97.5 %
(Intercept) 23.0498681 76.4923499
Prewt        0.1128268  0.7560955
TreatCont   -7.8754712 -0.3186599
TreatFT      0.3060571  8.8200682

The results of parametric linear regression and bootstrap are not very similar.

Plot the distribution of four linear model coefficients

par(mfrow=c(2,2))
hist(bootstrap_estimates[,1])
hist(bootstrap_estimates[,2])
hist(bootstrap_estimates[,3])
hist(bootstrap_estimates[,4])

The modified script

Function modularization: Create the functions

# Function for conducting bootstrap method once
bootstrap_single <- function(string_formula, model_data) {
  bootstrap_ids <- sample(seq(nrow(model_data)), nrow(model_data), replace = TRUE)
  bootstrap_data <- model_data[bootstrap_ids,]
  bootstrap_model <- lm(as.formula(string_formula), bootstrap_data)
  return(bootstrap_model$coefficients)
}

# Function for computing bootstrap confidence interval with 
# (1) the given bootstrap output and (2) specified confident level
bootstrap_CI <- function(coeff_mtx, level=.95) {
  out <- data.frame(apply(coeff_mtx, 2, function(x) quantile(x, (1 - level)/2)),
                    apply(coeff_mtx, 2, function(x) quantile(x, 1 - (1 - level)/2)))
  colnames(out) <- paste0(c('Lower', 'Upper'), ' bound of ', level*100, '% C.I.')
  rownames(out) <- colnames(coeff_mtx)
  return(out)
}

# Function for conducting bootstrap for given times (default=1000)
bootstrap_function2  <- function(string_formula, model_data, ndraws=1000, CI=FALSE, level=.95) {
  coeff_mtx <- t(sapply(1:ndraws, function(i) bootstrap_single(string_formula, model_data)))
  if (CI) {return(bootstrap_CI(coeff_mtx=coeff_mtx, level=level))}
  else {return(coeff_mtx)}
}

# Function for plotting the distribution of coefficients from the bootstrap output
plot_coeff_distri <- function(coeff_mtx) {
  nc <- ncol(coeff_mtx)
  par(mfrow=c((nc + 1) %/% 2, 2))
  lapply(1:nc, function(k) hist(coeff_mtx[,k], freq = FALSE, # use 'density' instead of 'frequency' as y
                                main = paste0('Histogram of coeff ', colnames(bootstrap_estimates)[k]),
                                xlab = colnames(bootstrap_estimates)[k]))
}

Use the functions

Conduct bootstrap method. The output is a matrix.

set.seed(1234)
bootstrap_estimates <- bootstrap_function2(string_formula="Postwt ~ Prewt + Treat",
                                           model_data=dta, ndraws=500)
dim(bootstrap_estimates)

[1] 500   4

head(bootstrap_estimates)

     (Intercept)     Prewt  TreatCont  TreatFT
[1,]    48.51441 0.4464814 -2.7719579 5.646139
[2,]    34.15512 0.5902138 -2.7889033 7.975395
[3,]    12.42604 0.8788593 -0.7156288 1.850311
[4,]    54.72025 0.3719938 -1.8728123 8.995408
[5,]    71.62591 0.1737466 -3.7647641 6.775371
[6,]    39.65098 0.5454511 -4.1275988 5.411284

Plot the distribution of coefficients from the bootstrap output.

plot_coeff_distri(bootstrap_estimates)

Use bootstrap_function2 to obtain 95% bootstrap CI of different bootstrap times.

confint(anrx_lm)

                 2.5 %     97.5 %
(Intercept) 23.0498681 76.4923499
Prewt        0.1128268  0.7560955
TreatCont   -7.8754712 -0.3186599
TreatFT      0.3060571  8.8200682

set.seed(1234)
lst_result <- lapply(c(500, 1000, 3000, 5000, 10000),
                     function(n) bootstrap_function2(string_formula="Postwt ~ Prewt + Treat", 
                                                     model_data=dta, ndraws=n, CI=TRUE))
names(lst_result) <- c(500, 1000, 3000, 5000, 10000)
lst_result$anrx_lm <- data.frame(confint(anrx_lm))

df_result <- lapply(lst_result, function(df) {
  colnames(df) <- c('lb', 'ub')
  df %>% mutate(estimate = rowMeans(df), var = rownames(df))}) %>% 
  data.table::rbindlist() %>%
  mutate(ndraws = factor(rep(names(lst_result), each = 4),
                         levels = c(500, 1000, 3000, 5000, 10000, 'anrx_lm')))
df_result

Plot to compare the results of different bootstrap times.

df_result %>% dplyr::filter(var == '(Intercept)') %>%
  ggplot(mapping = aes(x = ndraws, y = estimate), data = .) +
  geom_point(aes(color = ndraws)) +
  geom_errorbar(aes(ymax = ub, ymin = lb), width=.3, size=.2) + theme_bw() +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45, margin = margin(t=8)))

df_result %>% dplyr::filter(var != '(Intercept)') %>%
  ggplot(mapping = aes(x = ndraws, y = estimate), data = .) +
  facet_wrap(. ~ var) +
  geom_point(aes(color = ndraws)) +
  geom_errorbar(aes(ymax = ub, ymin = lb), width=.3, size=.2) + theme_bw() +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45, margin = margin(t=8)))

There are no obvious differences of coefficients among different bootstrap times (ndraws) and the parametric linear model. However, as ndraws is large enough, the CIs act as what CI of the parametric linear model does – do not cover zero.

In-class exercise 4.

What does the following R script do? Replace the while loop in the code with a different iteration control structure.

The original script

Brownian <- function(n = 11, pause = 0.05, nIter = 512, ...) {
    x = rnorm(n)
    y = rnorm(n)
    i = 1
    while (i <= nIter) {
        plot(x, y, ...)
        text(x, y, cex = 0.5)
        x = x + rnorm(n)
        y = y + rnorm(n)
        Sys.sleep(pause)
        i = i + 1
    }
} 
###

## test it
Brownian(xlim = c(-20, 20), ylim = c(-20, 20), nIter = 10,
    pch = 21, cex = 2, col = "cyan", bg = "lavender")

This script aims to demonstrate Brownian motion. Give the sample size n and the interation number nIter, the function Brownian would plot n points with text labels and random position following the standard normal distribution. More detailed arguments of plot can also be assigned in Brownian. Just like inclass exercise 1, the output in the markdown would be lots of plots. Thus, I do not show the output here. I turn output plots into an animation using ffmpeg to show the point-moving process clearly in one of the following modified scripts.

The modified script

Create the function

Brownian <- function(n=11, nIter=20, pause=0.05, ...) {
  xy <- replicate(2, apply(replicate(nIter, rnorm(n)), 1, cumsum))
  b <- 20 + ((nIter - 200)*(nIter > 200)) %/% 10
  invisible(lapply(1:nIter, function(i) {
    x <- xy[i,,1]; y <- xy[i,,2]
    par(pty = 's', mar = c(4, 2, 1, 1))
    plot(x, y, xlim = c(-b, b), ylim = c(-b, b), ...)
    text(x, y, cex = .6)
    text(-b + 10, b - 3, paste0('Iteration: ', i))
    Sys.sleep(pause)
  }))
}

Use the function

Brownian(n=10, nIter=30, pch = 21, col = "cyan", bg = "lavender")

Brownian(n=10, nIter=100, pch = 21, col = "cyan", bg = "lavender")

Brownian(n=30, nIter=500, pch = 21, col = "cyan", bg = "lavender")

In-class exercise 5.

Here is an example of simulation using R. Figure out how it works and improve the code.

This example shows how to sample, at random, from a population of 50.5% men and 49.5% women, in which height is normally distributed with a mean of 170 cm and a standard deviation of 7 cm for men, with corresponding values 160 and 5 for women. The output is a data frame, with the first column showing gender (M for male, F for female) and the second column showing height.

The original script

simSexHeight <- function(n)  {
   m <- matrix(c(runif(n), rnorm(n)), ncol=2)
   draw <- function(mr) {  # mr = one row of m
      if (mr[1] < 0.505) {
         sex <- 1
         cm <- 170 + 7*mr[2]
      } else {
         sex <- 0
         cm <- 160 + 5*mr[2] 
      }
      return(c(sex, cm))
   }
   person <- t(apply(m, 1, draw))
   person <- as.data.frame(person)
   person[, 1] <- ifelse(person[, 1]==1, "M", "F")
   names(person) <- c("Gender", "Height")
   return(person)
}

simSexHeight(5)

simSexHeight(20)

sapply(c(100, 500, 1000, 5000),
       function(n) table(simSexHeight(n)$Gender) / n)

  [,1]  [,2]  [,3]   [,4]
F 0.47 0.508 0.499 0.4986
M 0.53 0.492 0.501 0.5014

This script creates a function which can generate a sythetic data set of gender and height. The random variable gender follows the uniform distribution and height follows the standard normal distribution. The proportion of male is set around $50.5\%$.

The modified script

Create an index vector gender_idx for gender for setting gender lable (i.e., F or M) and height formula.
Use the normal distribution with different parameters (i.e., $\mu$ and $\sigma$) to replace if-else structure.
Create a new argument male_proportion to let the user assign the proportion of male, instead of fixing at $50.5\%$.

Create the function

simSexHeight <- function(n, male_proportion = .505, round_digits = 2)  {
  gender_idx <- runif(n) < male_proportion
  df <- data.frame(Gender = c('F', 'M')[gender_idx*1 + 1])
  df$Height[gender_idx] <- round(rnorm(sum(gender_idx), 170, 7), round_digits)
  df$Height[!gender_idx] <- round(rnorm(sum(!gender_idx), 160, 5), round_digits)
  return(df)
}

Use the function

simSexHeight(5)

Check the distribution of height

dta <- simSexHeight(1000, round_digits = 4)
dta[1:20,]

dta %>% group_by(Gender) %>% summarise(M = mean(Height), SD = sd(Height))

qplot(x = Height, data = dta, geom = 'density', col = Gender)

Check the proportion of males

lst <- lapply(c(.1, .6, .8), function(p) {
  m <- sapply(c(100, 500, 1000, 5000),
              function(n) table(simSexHeight(n, p)$Gender) / n)
  colnames(m) <- paste0('n = ', c(100, 500, 1000, 5000))
  return(m)
})
names(lst) <- paste0('p = ', c(0.1, 0.6, 0.8))
lst

$`p = 0.1`
  n = 100 n = 500 n = 1000 n = 5000
F    0.83   0.912    0.902   0.9056
M    0.17   0.088    0.098   0.0944

$`p = 0.6`
  n = 100 n = 500 n = 1000 n = 5000
F    0.42   0.342    0.371   0.3986
M    0.58   0.658    0.629   0.6014

$`p = 0.8`
  n = 100 n = 500 n = 1000 n = 5000
F    0.15   0.226    0.197   0.2008
M    0.85   0.774    0.803   0.7992

DataM: In-class Exercise 0511: Programming

DataM: In-class Exercise 0511: Programming

In-class exercise 1.

The original script

The modified scripts

Method 1

Create the functions

Use the function

Method 2

Create the functions

Use the function

In-class exercise 2.

Load and check

The original script

Get the subset of Asian participants

Method 1 of conducting correlation test

Method 2 of conducting correlation test

Method 3 of conducting correlation test (Use built-in cor.test function)

The modified script for method 1

Define the function

Use the function

In-class exercise 3.

The original script

Load and check the data set

Linear regression analysis

Use the use-defined function to conduct bootstrap

Create the function

Use the function to compute the estimate and CI of the linear model coefficients via bootstrap method

Plot the distribution of four linear model coefficients

The modified script

Function modularization: Create the functions

Use the functions

In-class exercise 4.

The original script

The modified script

Create the function

Use the function

In-class exercise 5.

The original script

The modified script

Create the function

Use the function

Check the distribution of height

Check the proportion of males

Method 3 of conducting correlation test (Use built-in `cor.test` function)