1 Group Homework

  • You will work with your group to complete this assignment.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Submit your group’s shared .Rmd AND “knitted”.html files on Collab.

  • Your “knitted .html” submission must be created from your “group .Rmd” but be created on your own computer.

  • Confirm this with the following comment included in your submission text box: “Honor Pledge: I have recreated my group submission using using the tools I have installed on my own computer”

  • Name the files with a group name and YOUR name for your submission.

  • One of the group members will present R codes and plots for Parts 2 and 3 in class on Nov. 2 (Tue). Please e-mail the instructor if you’re a presenter by 11:59pm, Nov. 1. Also, if you’re a presenter, please bring your laptop so that you can share your screen on zoom for the presentation.

2 Part 1

  • Use the SCS Data set you downloaded from the previous group assignments, and then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”).

  • Plot the data, linear line, and bootstrap confidence envelopes. Use 2,000 bootstrap replicates (i.e., R=2000) in function boot, and add appropriate x- and y- labels, and a title to the graph.

  • Please refer to section: Linear regression with bootstrap confidence intervals in DS3003_visualizingerrors_reg_note.html, DS3003_visualizingerrors_reg_code.html, and DS3003_visualizingerrors_reg_code.html

library(foreign)
library(boot)
library(ggplot2)
# reading in data
scs_data <- read.spss('SCS_QE.sav',to.data.frame = TRUE)
# boot function
# b.stat function from notes
b.stat <- function(data, i)
{
   b.dat <- data[i ,]
   out.lm <- lm(mars ~ mathpre, b.dat)
   predict(out.lm, data.frame(mathpre=scs_data2$mathpre))   
}
# first 100 cases
scs_data2 <- scs_data[1:100,]
b.out <- boot(scs_data2, b.stat, R = 2000) # 2000 bootstrap replicates

boot.ci(b.out, index = 1, type = 'perc') # 95% CI for the first observation
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b.out, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   (52.46, 60.28 )  
## Calculations and Intervals on Original Scale
# lower and upper bounds of confidence intervals
b.ci <- t(sapply(1:nrow(scs_data2), function(x) boot.ci(b.out, index = x, type = 'perc')$percent))[, 4:5]
dimnames(b.ci) <- list(rownames(scs_data2), c('lower', 'upper'))

# combining datasets
scs_data3 <- cbind(scs_data2, b.ci)

# plotting data, linear line, and bootstrap confidence intervals
ggplot(scs_data3, aes(x=mathpre, y=mars)) + geom_point(alpha=0.2) + labs(x = 'Math Achievement Score', y = 'Math Anxiety Score', title='Math Anxiety vs. Math Achievement') + theme_bw() + 
        geom_smooth(method='lm', formula= y~x, se = FALSE) +
        geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill="#69b3a2")

As shown in the graph, there is a negative relationship between Math Anxiety Score and Math Achievement Score. The 95% boostrap confidence intervals are smallest between Math Achievement Scores 5 through 10, corresponding to Math Anxiety Scores 50 through 60.

3 Part 2

  • Create WHO Reporting Barplots with error bars separated by WHO region using either facet_grid or facet_wrap.

  • First, get the latest data from from https://covid19.who.int/table.

    • The file should likely be named “WHO COVID-19 global table data October XXth 2021 at XXXXX.csv”

    • Don’t use the data that I uploaded on Collab. It’s not the most recent data.

  • Second, create a subset including 3 countries per WHO region (Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, Western Pacific). You can choose any three countries within each WHO region to compare the mortality rate (mutate(rate = "Deaths - cumulative total"/"Cases - cumulative total")).

  • Third, draw bar plots with error bars using your subset, but adjust the graph in the facets using either facet_grid or facet_wrap (e.g., facet_grid(~ "WHO region", scale="free"). Please include scale="free" in your facet function.

library(DT)
library(dplyr)
library(TSA)
library(tidyverse)
# reading in data
covid_data <- read_csv('WHO_COVID-19_data_October_31st.csv')

# using datatable to help me choose three countries per region
datatable(covid_data)
regions_and_countries <- data.frame('Region' = c('Africa','Africa','Africa', 'Americas','Americas','Americas','Eastern Mediterranean','Eastern Mediterranean','Eastern Mediterranean','Europe','Europe','Europe','South-East Asia','South-East Asia','South-East Asia','Western Pacific', 'Western Pacific', 'Western Pacific'), 
                           'Country' = c('South Africa', 'Ethiopia', 'Kenya', 'Colombia', 'Brazil', 'Argentina', 'Morocco', 'Iraq', 'Pakistan', 'France', 'Spain', 'Turkey', 'India', 'Indonesia', 'Thailand', 'China', 'Malaysia', 'Japan'))

# getting error bars
covid_data2 <- covid_data %>% filter(Name %in% regions_and_countries$Country) %>% 
  mutate(rate = `Deaths - cumulative total`/`Cases - cumulative total`, 
         SE = sqrt(rate*(1-rate)/`Cases - cumulative total`))
# plotting
ggplot(covid_data2, aes(x=rate, y=Name)) + geom_col() + theme_bw() + 
  xlab('Mortality Rate') + ylab('Country') + labs(title='COVID-19 Mortality Rate vs. Country by WHO Region') +
  geom_errorbar(aes(xmin=rate-1.96*SE, xmax=rate+1.96*SE), width=.2) + 
  facet_grid(~ covid_data2$`WHO Region`, scale="free") + coord_flip() + 
  # changing country tick labels and facet grid WHO region label size
  theme(axis.text.x = element_text(angle = 45,hjust=1),strip.text.x = element_text(size = 6.5))

4 Part 3

  • See TABLE 2. COVID-19 vaccine effectiveness against COVID-19–associated hospitalization among adults without immunocompromising conditions, by vaccine product — 21 hospitals in 18 U.S. states, March–August 2021 from a recent study on Comparative Effectiveness of Moderna, Pfizer-BioNTech, and Janssen (Johnson & Johnson) Vaccines.

  • Draw your best plot to visualize results of VE against COVID-19 hospitalization (95% CI), i.e., the third column of TABLE 2.

  • First, save data about Vaccine/Period and VE against COVID-19 hospitalization (95% CI).

    • You could create a .csv file including VE against COVID-19 hospitalization (95% CI) and Vaccine/Period.
    • Or, you could save data in R directly, say using data.frame() or tibble().
  • Second, draw bar plots with error bars, using the data you saved in the first step.

    • You could use the facets function as in Part 2.

    • Or, you could draw separate bar plots for each vaccine and collect three bar plots into a single figure using something like, e.g., gridExtra::grid.arrange(). Here are some helpul notes and explanations.

vaccine_data <- data.frame('Vaccine' = c('Moderna', 'Moderna','Moderna','Pfizer-BioNTech','Pfizer-BioNTech','Pfizer-BioNTech', 'Janssen','Janssen'), 
                           'VaccinePeriod' = c('Full surveillance period','14-120 days after full vaccination','>120 days after full vaccination','Full surveillance period','14-120 days after full vaccination','>120 days after full vaccination','Full surveillance period','>28 days after full vaccination'),
                           'VE' = c(93,93,92,88,91,77,71,68), 
                           'LowerVECI' = c(91,90,87,85,88,67,56,49),
                           'UpperVECI' = c(95,95,96,91,93,84,81,80))

# plotting
attach(vaccine_data)
ggplot(vaccine_data, aes(x=VE, y=VaccinePeriod)) + geom_col() + theme_bw() + 
  xlab('VE against COVID-19 hospitalization') + ylab('Vaccine/Period') + labs(title='VE vs. Vaccine/Period by Vaccine') +
  geom_errorbar(aes(xmin=LowerVECI, xmax=UpperVECI), width=.2) + 
  facet_grid(~ Vaccine, scale="free") + coord_flip() + 
  # changing country tick labels and facet grid WHO region label size
  theme(axis.text.x = element_text(angle = 45,hjust=1),strip.text.x = element_text(size = 7))