In-class exercise 1.

What does the following R script do? Revise the code so that it does the same job without resorting to the use of a nested ‘for’ loop.

The original script

This script uses the animation to demonstrate the process of finding the average value that is significantly different from zero with defaulted \(\alpha=.05\). That is, the 95% confidence interval does not cover zero. The target CI bars are marked with obvious color. Since the output in the markdown would be lots of plots, I do not show the output here. I turn output plots into an animation gif to show the process clearly in one of the following modified scripts.

The modified scripts

  1. Use several user-defined functions (modularization) and lapply instead of the nested for loop structure.
  2. Create a new argument seeds to let function user set seed if she/he wants to keep the same result.
  3. Create a new argument show_ani to let function user set if an animation is generated. If show_ani==TRUE, output plots would be turned into an animation gif to show the process clearly.

Method 1

In this method, commands of package animation are used to save the animation as a gif file and load it in the markdown file out of the R code chunk. This method can show the whole changing process.

Method 2

In this method, ffmpeg is used to show every plot in the given R code chunk in an animation format. Saving the animation as a gif file and loading it back is not necessary in this method. However, this method can show the changing process of diffenent running iterations instead of the whole changing process. Note that ffmpeg is not an R package. If you are on a macOS, you can install ffmpeg through Homebrew using the bash command brew install ffmpeg in the terminal. For more details: https://blogdown-demo.rbind.io/2018/01/31/gif-animations/.

In-class exercise 2.

Use the read and math variables from the high schools data example for this problem. First firgure out what the following R script does and then eliminate the for loop in the code segment.

The data set consists of measurements on 11 variables for a sample of 200 students. Source: UCLA Academic Technology Service: R class note.

  • Column 1: Student ID
  • Column 2: Gender, M = Male, F = Female
  • Column 3: Race
  • Column 4: Socioeconomic status
  • Column 5: School type
  • Column 6: Program type
  • Column 7: Reading score
  • Column 8: Writing score
  • Column 9: Math score
  • Column 10: Science score
  • Column 11: Social science studies score

Load and check

'data.frame':   200 obs. of  11 variables:
 $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
 $ female : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
 $ race   : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
 $ ses    : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
 $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
 $ prog   : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
 $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
 $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
 $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
 $ science: int  47 63 58 53 53 63 53 39 58 NA ...
 $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

The original script

Get the subset of Asian participants

Method 1 of conducting correlation test

First, compute the correlation coefficient of two variables math and socst in dta.asian. Name it r0.

Then, use the following scipt to count how many times that the correlation coefficient of shuffled read is not less than r0. Name the count no. cnt. Compute the proportion of cnt out of nIter times.

[1] 0.04295704
[1] TRUE

Method 2 of conducting correlation test

Shuffle dta.asian$read for nIter times and turn the output into a data frame called newread.

Compute the correlation coefficient of dta.asian$math and each column in newread. Since the output of the original command lapply contains too many to display easily. I do not show it here.

I turn lapply into sapply to make the output turned into a vector and show the first 10 elements for easy.

         X1          X2          X3          X4          X5          X6 
-0.21891609  0.09966721 -0.05124067 -0.41109792 -0.03834256 -0.09251462 
         X7          X8          X9         X10 
-0.68582764 -0.05639991 -0.04608143 -0.07058783 

Compute the proportion that iterated correlation coefficient reach r0.

[1] 0.04295704
[1] TRUE

Method 3 of conducting correlation test (Use built-in cor.test function)


    Pearson's product-moment correlation

data:  dta[dta$race == "asian", "math"] and dta[dta$race == "asian", "socst"]
t = 1.9887, df = 9, p-value = 0.07796
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.07083501  0.86552255
sample estimates:
      cor 
0.5525177 

In-class exercise 3.

An example of obtaining (nonparametric) bootstrap estimates of coefficients for a multiple linear regression model for the anorexia{MASS} data set is presented in the following script. Figure out how it works and improve the code.

The original script

Load and check the data set

  Treat        Prewt           Postwt      
 CBT :29   Min.   :70.00   Min.   : 71.30  
 Cont:26   1st Qu.:79.60   1st Qu.: 79.33  
 FT  :17   Median :82.30   Median : 84.05  
           Mean   :82.41   Mean   : 85.17  
           3rd Qu.:86.00   3rd Qu.: 91.55  
           Max.   :94.90   Max.   :103.60  
[1] 72  3
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
[15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
[29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
[43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
[71] "71" "72"

Linear regression analysis


Call:
lm(formula = Postwt ~ Prewt + Treat, data = dta)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1083  -4.2773  -0.5484   5.4838  15.2922 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.7711    13.3910   3.717  0.00041 ***
Prewt         0.4345     0.1612   2.695  0.00885 ** 
TreatCont    -4.0971     1.8935  -2.164  0.03400 *  
TreatFT       4.5631     2.1333   2.139  0.03604 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.978 on 68 degrees of freedom
Multiple R-squared:  0.2777,    Adjusted R-squared:  0.2458 
F-statistic: 8.713 on 3 and 68 DF,  p-value: 5.719e-05
(Intercept)       Prewt   TreatCont     TreatFT 
 49.7711090   0.4344612  -4.0970655   4.5630627 

Use the use-defined function to conduct bootstrap

Use the function to compute the estimate and CI of the linear model coefficients via bootstrap method

(Intercept)       Prewt   TreatCont     TreatFT 
 37.2243800   0.5579126   0.9651200   5.3044083 
            [,1]       [,2]
[1,] 17.20076434 79.0250832
[2,]  0.07060444  0.8292259
[3,] -7.11445046 -0.2725961
[4,] -0.09495921  8.9724349
                 2.5 %     97.5 %
(Intercept) 23.0498681 76.4923499
Prewt        0.1128268  0.7560955
TreatCont   -7.8754712 -0.3186599
TreatFT      0.3060571  8.8200682

The results of parametric linear regression and bootstrap are not very similar.

The modified script

Function modularization: Create the functions

# Function for conducting bootstrap method once
bootstrap_single <- function(string_formula, model_data) {
  bootstrap_ids <- sample(seq(nrow(model_data)), nrow(model_data), replace = TRUE)
  bootstrap_data <- model_data[bootstrap_ids,]
  bootstrap_model <- lm(as.formula(string_formula), bootstrap_data)
  return(bootstrap_model$coefficients)
}

# Function for computing bootstrap confidence interval with 
# (1) the given bootstrap output and (2) specified confident level
bootstrap_CI <- function(coeff_mtx, level=.95) {
  out <- data.frame(apply(coeff_mtx, 2, function(x) quantile(x, (1 - level)/2)),
                    apply(coeff_mtx, 2, function(x) quantile(x, 1 - (1 - level)/2)))
  colnames(out) <- paste0(c('Lower', 'Upper'), ' bound of ', level*100, '% C.I.')
  rownames(out) <- colnames(coeff_mtx)
  return(out)
}

# Function for conducting bootstrap for given times (default=1000)
bootstrap_function2  <- function(string_formula, model_data, ndraws=1000, CI=FALSE, level=.95) {
  coeff_mtx <- t(sapply(1:ndraws, function(i) bootstrap_single(string_formula, model_data)))
  if (CI) {return(bootstrap_CI(coeff_mtx=coeff_mtx, level=level))}
  else {return(coeff_mtx)}
}

# Function for plotting the distribution of coefficients from the bootstrap output
plot_coeff_distri <- function(coeff_mtx) {
  nc <- ncol(coeff_mtx)
  par(mfrow=c((nc + 1) %/% 2, 2))
  lapply(1:nc, function(k) hist(coeff_mtx[,k], freq = FALSE, # use 'density' instead of 'frequency' as y
                                main = paste0('Histogram of coeff ', colnames(bootstrap_estimates)[k]),
                                xlab = colnames(bootstrap_estimates)[k]))
}

Use the functions

  • Conduct bootstrap method. The output is a matrix.
[1] 500   4
     (Intercept)     Prewt  TreatCont  TreatFT
[1,]    48.51441 0.4464814 -2.7719579 5.646139
[2,]    34.15512 0.5902138 -2.7889033 7.975395
[3,]    12.42604 0.8788593 -0.7156288 1.850311
[4,]    54.72025 0.3719938 -1.8728123 8.995408
[5,]    71.62591 0.1737466 -3.7647641 6.775371
[6,]    39.65098 0.5454511 -4.1275988 5.411284
  • Plot the distribution of coefficients from the bootstrap output.

  • Use bootstrap_function2 to obtain 95% bootstrap CI of different bootstrap times.
                 2.5 %     97.5 %
(Intercept) 23.0498681 76.4923499
Prewt        0.1128268  0.7560955
TreatCont   -7.8754712 -0.3186599
TreatFT      0.3060571  8.8200682
  • Plot to compare the results of different bootstrap times.

There are no obvious differences of coefficients among different bootstrap times (ndraws) and the parametric linear model. However, as ndraws is large enough, the CIs act as what CI of the parametric linear model does – do not cover zero.


In-class exercise 4.

What does the following R script do? Replace the while loop in the code with a different iteration control structure.

The original script

This script aims to demonstrate Brownian motion. Give the sample size n and the interation number nIter, the function Brownian would plot n points with text labels and random position following the standard normal distribution. More detailed arguments of plot can also be assigned in Brownian. Just like inclass exercise 1, the output in the markdown would be lots of plots. Thus, I do not show the output here. I turn output plots into an animation using ffmpeg to show the point-moving process clearly in one of the following modified scripts.

In-class exercise 5.

Here is an example of simulation using R. Figure out how it works and improve the code.

This example shows how to sample, at random, from a population of 50.5% men and 49.5% women, in which height is normally distributed with a mean of 170 cm and a standard deviation of 7 cm for men, with corresponding values 160 and 5 for women. The output is a data frame, with the first column showing gender (M for male, F for female) and the second column showing height.

The original script

  [,1]  [,2]  [,3]   [,4]
F 0.47 0.508 0.499 0.4986
M 0.53 0.492 0.501 0.5014

This script creates a function which can generate a sythetic data set of gender and height. The random variable gender follows the uniform distribution and height follows the standard normal distribution. The proportion of male is set around \(50.5\%\).

The modified script

  1. Create an index vector gender_idx for gender for setting gender lable (i.e., F or M) and height formula.
  2. Use the normal distribution with different parameters (i.e., \(\mu\) and \(\sigma\)) to replace if-else structure.
  3. Create a new argument male_proportion to let the user assign the proportion of male, instead of fixing at \(50.5\%\).

Use the function

Check the distribution of height

Check the proportion of males

$`p = 0.1`
  n = 100 n = 500 n = 1000 n = 5000
F    0.83   0.912    0.902   0.9056
M    0.17   0.088    0.098   0.0944

$`p = 0.6`
  n = 100 n = 500 n = 1000 n = 5000
F    0.42   0.342    0.371   0.3986
M    0.58   0.658    0.629   0.6014

$`p = 0.8`
  n = 100 n = 500 n = 1000 n = 5000
F    0.15   0.226    0.197   0.2008
M    0.85   0.774    0.803   0.7992