Honor Pledge: I have recreated my group submission using using the tools I have installed on my own computer"-Brittany Nguyen

1 Group Homework

  • You will work with your group to complete this assignment.

  • Upload your html file on RPubs and include the link when you submit your submission files on Collab.

  • Submit your group’s shared .Rmd AND “knitted”.html files on Collab.

  • Your “knitted .html” submission must be created from your “group .Rmd” but be created on your own computer.

  • Confirm this with the following comment included in your submission text box: “Honor Pledge: I have recreated my group submission using using the tools I have installed on my own computer”

  • Name the files with a group name and YOUR name for your submission.

  • One of the group members will present R codes and plots for Parts 2 and 3 in class on Nov. 2 (Tue). Please e-mail the instructor if you’re a presenter by 11:59pm, Nov. 1. Also, if you’re a presenter, please bring your laptop so that you can share your screen on zoom for the presentation.

2 Part 1

  • Use the SCS Data set you downloaded from the previous group assignments, and then investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”).

  • Plot the data, linear line, and bootstrap confidence envelopes. Use 2,000 bootstrap replicates (i.e., R=2000) in function boot, and add appropriate x- and y- labels, and a title to the graph.

  • Please refer to section: Linear regression with bootstrap confidence intervals in DS3003_visualizingerrors_reg_note.html, DS3003_visualizingerrors_reg_code.html, and DS3003_visualizingerrors_reg_code.html

library(foreign)
require(car)
## Loading required package: car
## Loading required package: carData
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
library(ggplot2)


SCS <- read.spss('SCS_QE (1).sav',to.data.frame = TRUE)
## re-encoding from CP1252
## Warning in read.spss("SCS_QE (1).sav", to.data.frame = TRUE): Undeclared
## level(s) 0 added in variable: married
b.stat <- function(data, i)
{
   b.dat <- data[i ,]
   out.lm <- lm(mathpre ~ mars, b.dat)
   predict(out.lm, data.frame(mars=SCS$mars))   
}

b.out <- boot(SCS, b.stat, R = 2000) # R = num of replications

boot.ci(b.out, index = 1, type = 'perc') # 95% CI for the first observation
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b.out, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 6.163,  6.935 )  
## Calculations and Intervals on Original Scale
b.ci <- t(sapply(1:nrow(SCS), function(x) boot.ci(b.out, index = x, type = 'perc')$percent))[, 4:5]
dimnames(b.ci) <- list(rownames(SCS), c('lower', 'upper'))
kable(head(b.ci, 4))
lower upper
6.163093 6.934756
5.246654 6.042866
5.466914 6.187408
6.619554 7.773742
SCS2 <- cbind(SCS, b.ci) # combine two datasets
ggplot(SCS2, aes(x=mars, y=mathpre)) + geom_point(alpha=0.2) + labs(x = 'Math Anxiety Score', y = 'Math Achievement Score', title = "The Relationship Between Math Anxiety Score and Math Achievement Score") + theme_bw() + 
        geom_smooth(method='lm', formula= y~x, se = FALSE) +
        geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill="#69b3a2")

3 Part 2

  • Create WHO Reporting Barplots with error bars separated by WHO region using either facet_grid or facet_wrap.

  • First, get the latest data from from https://covid19.who.int/table.

    • The file should likely be named “WHO COVID-19 global table data October XXth 2021 at XXXXX.csv”

    • Don’t use the data that I uploaded on Collab. It’s not the most recent data.

  • Second, create a subset including 3 countries per WHO region (Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, Western Pacific). You can choose any three countries within each WHO region to compare the mortality rate (mutate(rate = "Deaths - cumulative total"/"Cases - cumulative total")).

  • Third, draw bar plots with error bars using your subset, but adjust the graph in the facets using either facet_grid or facet_wrap (e.g., facet_grid(~ "WHO region", scale="free"). Please include scale="free" in your facet function.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()
whocovid <- read.csv('WHO COVID-19.csv')

whocovid <- whocovid[is.element(whocovid$Name, c("South Africa", "Ethiopia", "Nigeria", "United States of America", "Canada", "Mexico", "Iran (Islamic Republic of)", "Iraq", "Pakistan", "Israel", "Belgium", "Kazakhstan", "India", "Indonesia", "Thailand", "Philippines", "Malaysia", "Japan")),]

whocovid <- whocovid %>% mutate(rate = Deaths...cumulative.total / Cases...cumulative.total)

#whocovid
library(ggplot2)

whocovid <- whocovid %>% mutate(se = sqrt(rate * (1 - rate) / Cases...cumulative.total), lower = rate - 1.96 * se, upper = rate + 1.96 * se) #for error bars

ggplot(whocovid, aes(x=Name, y=rate, fill=WHO.Region)) + 
geom_bar(position=position_dodge(), stat="identity",
         colour="black") +     
labs(x="Country", y="Mortality Rate", title="COVID-19 Mortality Rate by Country Across Different Regions", fill="Region") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + facet_grid(~WHO.Region, scales="free") +
geom_errorbar(aes(ymin=lower, ymax=upper),
                size=.3,
                width=.2,
                position=position_dodge(.9))

While the error bars were very small for all regions (indicating small variation per country), we observed that the Americas had a high mortality rate for the countries observed (with a high spike in Mexico). The mortality rate is also high for Indonesia. The smallest mortality rates overall were observed for the Western Pacific Region. Israel had the smallest mortality rate. Much of the mortality rate can be attributed to the country-specific lock down restrictions (Japan’s strict enforcement can be correlated with its low mortality rate).

4 Part 3

  • See TABLE 2. COVID-19 vaccine effectiveness against COVID-19–associated hospitalization among adults without immunocompromising conditions, by vaccine product — 21 hospitals in 18 U.S. states, March–August 2021 from a recent study on Comparative Effectiveness of Moderna, Pfizer-BioNTech, and Janssen (Johnson & Johnson) Vaccines.

  • Draw your best plot to visualize results of VE against COVID-19 hospitalization (95% CI), i.e., the third column of TABLE 2.

  • First, save data about Vaccine/Period and VE against COVID-19 hospitalization (95% CI).

    • You could create a .csv file including VE against COVID-19 hospitalization (95% CI) and Vaccine/Period.
    • Or, you could save data in R directly, say using data.frame() or tibble().
  • Second, draw bar plots with error bars, using the data you saved in the first step.

    • You could use the facets function as in Part 2.

    • Or, you could draw separate bar plots for each vaccine and collect three bar plots into a single figure using something like, e.g., gridExtra::grid.arrange(). Here are some helpul notes and explanations.

getwd()
## [1] "/Users/brittanynguyen/Downloads"
vedata<-read.csv('vedata (1).csv')
str(vedata)
## 'data.frame':    8 obs. of  5 variables:
##  $ Period       : chr  "Full" "14-120 days" ">120 days" "Full" ...
##  $ Effectiveness: int  93 93 92 88 91 77 71 68
##  $ Vaccine      : chr  "Moderna" "Moderna" "Moderna" "Pfizer" ...
##  $ Lower        : int  91 90 87 85 88 67 56 49
##  $ Upper        : int  95 95 96 91 93 84 81 80
vedata$Vaccine<-as.factor(vedata$Vaccine)
vedata$Period<-as.factor(vedata$Period)
vedata
##        Period Effectiveness Vaccine Lower Upper
## 1        Full            93 Moderna    91    95
## 2 14-120 days            93 Moderna    90    95
## 3   >120 days            92 Moderna    87    96
## 4        Full            88  Pfizer    85    91
## 5 14-120 days            91  Pfizer    88    93
## 6   >120 days            77  Pfizer    67    84
## 7        Full            71      JJ    56    81
## 8    >28 days            68      JJ    49    80
str(vedata)
## 'data.frame':    8 obs. of  5 variables:
##  $ Period       : Factor w/ 4 levels ">120 days",">28 days",..: 4 3 1 4 3 1 4 2
##  $ Effectiveness: int  93 93 92 88 91 77 71 68
##  $ Vaccine      : Factor w/ 3 levels "JJ","Moderna",..: 2 2 2 3 3 3 1 1
##  $ Lower        : int  91 90 87 85 88 67 56 49
##  $ Upper        : int  95 95 96 91 93 84 81 80
library(ggplot2)
ggplot(vedata, aes(x = Vaccine, y=Effectiveness, fill=Vaccine)) + 
    geom_bar(position=position_dodge(), stat="identity", colour='black') +
    geom_errorbar(aes(ymin=Lower, ymax=Upper),
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))+
  facet_grid(~Period, scales = "free")+labs(title='Vacccine Effectiveness by Brand and Period with 95% Confidence Intervals', x='Vaccine', y='Vaccine Effectiveness (%)')

ggplot(vedata, aes(x = Vaccine, y=Effectiveness, fill=Vaccine)) + 
    geom_bar(position=position_dodge(), stat="identity",
             colour="black" # Use black outlines,
             ) +     
    geom_errorbar(aes(ymin=Lower, ymax=Upper),
                  size=.3,    # Thinner lines
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9)) +
    labs(x="Vaccine", y="Vaccine Effectiveness (%)", title="Vacccine Effectiveness by Brand and Period with 95% Confidence Intervals") +
    theme_bw()+facet_wrap(~Period, scales="free")

As demonstrated by the above bar graphs, Moderna has the highest Vaccine Efficacy in all periods (except for the >28 day period). It is significantly more effective in the >120 day period than in the other time frames, respectively. Additionally, the 95% error bars illustrate that the Moderna vaccine has the least variability in vaccine efficacy compared to the other two vaccine brands. Less variability indicates a more consistently effective vaccine. While the first plot employs facet_grid() to distinguish by Period, the second plot uses facet_wrap() on Period to achieve a similar effect.