By the end of this lab:
R stores data as dataframe object which contains a collection of vectors representing rows/observations and columns/attributes in the dataframe. In this class, for instance, we have used a number of data files downloaded from the US CDC website to calculate rates and ratios; each of those data files represents an object in R, while the attributes such as jurisdiction, race/ethinicity, fatality status represent the vectors.
Yet Another Markup LanguageYAML Aint Markup LanguageR Markdown document:
R Markdown documents are HTML (such as one we are using for this lab), CSS, PDF and Word documents.Example of default YAML:
---
title: "lab01: Introduction to R and RStudio Utilities"
author: "Elhakim Ibrahim"
date: "10/20/2020"
output: html_document
---Check below links for more details:
R functions work on codes specified in dedicated block called chunk. R allows for additional list of workspace customization options by specifying a global setting options for the entire R session using opts_chunk$set() function:
remove()
rm(list=ls())
knitr::opts_chunk$set(
comment=NULL, message=F, warning=F, error=F, tidy=T,
prompt=F, cache=F, fig.align='center', fig.ext='png')
options(prompt="e>",digits=5,scipen=10,max.print=100000)instal.packages("nameofpackage") functionlibrary(nameofpackage) function; this time without quotes around the name:R has lots of functions embedded in the libraries which enable users to download different data types (e.g. haven()), perform data restructuring, creation, subsetting, recoding, renaming, to mention but a few, as well as basic and complex tabulations. We shall go through some of these basic functions with the US COVID-19 data downloaded from the CDC website.
I called in the United States COVID-19 data (object) directly into my work environment from the CDC website using URL link. This is one of the classical flexibility allowed by R. I also restricted the data to year 2020 only withsubset() function, being the year of interest. The merits of this approach include the ease of access to most recent data from the central database and the tons of space it allows user to save on storage disk.
url_us <- "https://data.cdc.gov/api/views/qfhf-uhaa/rows.csv?accessType=DOWNLOAD"
usdat0 <- read_csv(url_us)
## usdat0<-subset(usdat0, Outcome=='All Cause' & Type=='Unweighted' &
## MMWRYear==2020)
usdat0 <- subset(usdat0, Outcome == "All Cause" & Type == "Predicted (weighted)" &
MMWRYear == 2020 & MMWRWeek <= 39)In some instances, column names could have space or special character in-between which could be a torn in the flesh. Imagine having to manually edit an excel dataset/object with 100 columns with such “disturbing characters” that need be clean before working on the data.
Good news!!! We can acheive this with one line of codes using the gsub() command. The command allows us to format the column names; such as removing trailing spaces and special characters, or replacing them with choice formatting. In this case, I removed the spaces and hyphens in-between column names. I also added two new columns- Rowcount, PriorNumberofDeaths and MMWRMonth to the dataframe:
Using dim() function, it is observed that the file contains 12636 rows (or cases) and 18 columns (or data attributes).
[1] 12636 18
Meanwhile, the names() function is applied to display the dataframe column names:
[1] "Jurisdiction" "WeekEndingDate"
[3] "StateAbbreviation" "MMWRYear"
[5] "MMWRWeek" "RaceEthnicity"
[7] "TimePeriod" "Suppress"
[9] "Note" "Outcome"
[11] "NumberofDeaths" "AverageNumberofDeathsinTimePeriod"
[13] "Differencefrom20152019to2020" "PercentDifferencefrom20152019to2020"
[15] "Type" "Rowcount"
[17] "PriorNumberofDeaths" "MMWRMonth"
Next step is to select the needed columns using a combination of subset() and select() functions, and rename the columns using transmute() function which automatically discards the original columns:
usdat <- subset(usdat0, select = c(Jurisdiction, Outcome, WeekEndingDate, MMWRYear,
MMWRMonth, MMWRWeek, RaceEthnicity, PriorNumberofDeaths, NumberofDeaths, AverageNumberofDeathsinTimePeriod,
Differencefrom20152019to2020, PercentDifferencefrom20152019to2020, Type))
usdat <- usdat %>% transmute(jurisd = Jurisdiction, outcome = Outcome, estimatetype = Type,
wkendate = WeekEndingDate, mmwryr = MMWRYear, mmwrmo = MMWRMonth, mmwrwk = MMWRWeek,
racethn = RaceEthnicity, projected_deaths = PriorNumberofDeaths, observed_deaths = NumberofDeaths,
avrg_observed_deaths = AverageNumberofDeathsinTimePeriod, excess_deaths = Differencefrom20152019to2020,
perc_excess_deaths = PercentDifferencefrom20152019to2020)
rm(usdat0)Then, I validate the transmuted column headers:
[1] "jurisd" "outcome" "estimatetype"
[4] "wkendate" "mmwryr" "mmwrmo"
[7] "mmwrwk" "racethn" "projected_deaths"
[10] "observed_deaths" "avrg_observed_deaths" "excess_deaths"
[13] "perc_excess_deaths"
Lastly, I check the the structure of the data structure using theglimpse() function:
Rows: 12,636
Columns: 13
$ jurisd <chr> "Alabama", "Alabama", "Alabama", "Alabama", "A...
$ outcome <chr> "All Cause", "All Cause", "All Cause", "All Ca...
$ estimatetype <chr> "Predicted (weighted)", "Predicted (weighted)"...
$ wkendate <chr> "01/04/2020", "01/11/2020", "01/18/2020", "01/...
$ mmwryr <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020...
$ mmwrmo <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ mmwrwk <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
$ racethn <chr> "Hispanic", "Hispanic", "Hispanic", "Hispanic"...
$ projected_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ observed_deaths <dbl> NA, NA, NA, NA, 10, 11, NA, 12, NA, NA, NA, 10...
$ avrg_observed_deaths <dbl> NA, NA, NA, NA, 10, 11, NA, 12, NA, NA, NA, 10...
$ excess_deaths <dbl> NA, NA, NA, NA, 2, 4, NA, 5, NA, NA, NA, 6, NA...
$ perc_excess_deaths <dbl> NA, NA, NA, NA, 25.0, 57.1, NA, 71.4, NA, NA, ...
Notes
In the following steps, I generated values formmwrmo column based on thewkendate column to specify the months of data reporting.
wkendate column toPOSIXlt class of data:[1] "01/04/2020" "01/11/2020" "01/18/2020" "01/25/2020" "02/01/2020"
[1] "08/29/2020" "09/05/2020" "09/12/2020" "09/19/2020" "09/26/2020"
[1] "POSIXlt" "POSIXt"
[1] "2020-01-04 CST" "2020-01-11 CST" "2020-01-18 CST" "2020-01-25 CST"
[5] "2020-02-01 CST"
[1] "2020-08-29 CDT" "2020-09-05 CDT" "2020-09-12 CDT" "2020-09-19 CDT"
[5] "2020-09-26 CDT"
mmwrmos of reporting from the wkendate column:Frequencies
usdat$mmwrmo
Type: Character
Freq %
----------- ------- --------
01 1296 10.26
02 1620 12.82
03 1296 10.26
04 1296 10.26
05 1620 12.82
06 1296 10.26
07 1296 10.26
08 1620 12.82
09 1296 10.26
Total 12636 100.00
mmwrmos to a factor class of data with descriptive mmwrmo labels. Then, I reorder the columns such that mmwryr, mmwrwk,mmwrmo and racethn will shift from columns 4, 5, 11 and 6 to occupy columns 6, 4, 5 and 11, respectively:usdat$mmwrmo <- factor(usdat$mmwrmo, levels = c("01", "02", "03", "04", "05", "06",
"07", "08", "09", "10"), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct"))
freq(usdat$mmwrmo, report.nas = F, cumul = F)Frequencies
usdat$mmwrmo
Type: Factor
Freq %
----------- ------- --------
Jan 1296 10.26
Feb 1620 12.82
Mar 1296 10.26
Apr 1296 10.26
May 1620 12.82
Jun 1296 10.26
Jul 1296 10.26
Aug 1620 12.82
Sep 1296 10.26
Oct 0 0.00
Total 12636 100.00
[1] "jurisd" "outcome" "estimatetype"
[4] "wkendate" "mmwryr" "mmwrmo"
[7] "mmwrwk" "racethn" "projected_deaths"
[10] "observed_deaths" "avrg_observed_deaths" "excess_deaths"
[13] "perc_excess_deaths"
Here, I simply recode race/ethnicity to shorter labels and replace NAs with zeros:
Frequencies
usdat$racethn
Type: Character
Freq %
--------------------------------------------------- ------- --------
Hispanic 2106 16.67
Non-Hispanic American Indian or Alaska Native 2106 16.67
Non-Hispanic Asian 2106 16.67
Non-Hispanic Black 2106 16.67
Non-Hispanic White 2106 16.67
Other 2106 16.67
Total 12636 100.00
usdat$racethn <- recode(usdat$racethn, Hispanic = "Hispa", `Non-Hispanic American Indian or Alaska Native` = "IA/AN",
`Non-Hispanic Asian` = "Asian", `Non-Hispanic Black` = "Black", `Non-Hispanic White` = "White")
freq(usdat$racethn, report.nas = F, cumul = F)Frequencies
usdat$racethn
Type: Character
Freq %
----------- ------- --------
Asian 2106 16.67
Black 2106 16.67
Hispa 2106 16.67
IA/AN 2106 16.67
Other 2106 16.67
White 2106 16.67
Total 12636 100.00
usdat <- usdat %>% mutate(projected_deaths = coalesce(projected_deaths, 0), observed_deaths = coalesce(observed_deaths,
0), avrg_observed_deaths = coalesce(avrg_observed_deaths, 0), excess_deaths = coalesce(excess_deaths,
0), perc_excess_deaths = coalesce(perc_excess_deaths, 0))I generated values forprojected_.deaths column representing the difference between observed_.deaths and excess_deaths, then check the summary statistics for number of excess death in 2020:
usdat$projected_deaths_ <- as.numeric(usdat$observed_deaths - usdat$excess_deaths)
summary(usdat$excess_deaths) Min. 1st Qu. Median Mean 3rd Qu. Max.
-1352.0 0.0 0.0 55.4 18.0 11505.0
This implies estimated excess deaths as low as -1352 and as high as 11505 across the United States in year 2020 relative to similar time period (in months) between 2015 and 2019.
The View() function allows for viewing the dataframe:
Here, I attempt to create separate files by race/ethnicity and combine them into a wide-format data sets for the United States and State of Texas.
usdat_ <- subset(usdat, racethn == "Hispa", select = c(jurisd, outcome, estimatetype,
wkendate, mmwryr, mmwrmo, mmwrwk))
dim(usdat_)[1] 2106 7
usdat_ia <- subset(usdat, racethn == "IA/AN")
usdat_ia <- usdat_ia %>% mutate(projected_deaths_ia = projected_deaths_, observed_deaths_ia = observed_deaths,
avrg_observed_deaths_ia = avrg_observed_deaths, excess_deaths_ia = excess_deaths,
perc_excess_deaths_ia = perc_excess_deaths, racethn_ia = racethn)
usdat_ia <- subset(usdat_ia, select = c(racethn_ia, projected_deaths_ia, observed_deaths_ia,
avrg_observed_deaths_ia, excess_deaths_ia, perc_excess_deaths_ia))
dim(usdat_ia)[1] 2106 6
usdat_as <- subset(usdat, racethn == "Asian")
usdat_as <- usdat_as %>% mutate(projected_deaths_as = projected_deaths_, observed_deaths_as = observed_deaths,
avrg_observed_deaths_as = avrg_observed_deaths, excess_deaths_as = excess_deaths,
perc_excess_deaths_as = perc_excess_deaths, racethn_as = racethn)
usdat_as <- subset(usdat_as, select = c(racethn_as, projected_deaths_as, observed_deaths_as,
avrg_observed_deaths_as, excess_deaths_as, perc_excess_deaths_as))
dim(usdat_as)[1] 2106 6
usdat_bl <- subset(usdat, racethn == "Black")
usdat_bl <- usdat_bl %>% mutate(projected_deaths_bl = projected_deaths_, observed_deaths_bl = observed_deaths,
avrg_observed_deaths_bl = avrg_observed_deaths, excess_deaths_bl = excess_deaths,
perc_excess_deaths_bl = perc_excess_deaths, racethn_bl = racethn)
usdat_bl <- subset(usdat_bl, select = c(racethn_bl, projected_deaths_bl, observed_deaths_bl,
avrg_observed_deaths_bl, excess_deaths_bl, perc_excess_deaths_bl))
dim(usdat_bl)[1] 2106 6
usdat_hi <- subset(usdat, racethn == "Hispa")
usdat_hi <- usdat_hi %>% mutate(projected_deaths_hi = projected_deaths_, observed_deaths_hi = observed_deaths,
avrg_observed_deaths_hi = avrg_observed_deaths, excess_deaths_hi = excess_deaths,
perc_excess_deaths_hi = perc_excess_deaths, racethn_hi = racethn)
usdat_hi <- subset(usdat_hi, select = c(racethn_hi, projected_deaths_hi, observed_deaths_hi,
avrg_observed_deaths_hi, excess_deaths_hi, perc_excess_deaths_hi))
dim(usdat_hi)[1] 2106 6
usdat_ot <- subset(usdat, racethn == "Other")
usdat_ot <- usdat_ot %>% mutate(projected_deaths_ot = projected_deaths_, observed_deaths_ot = observed_deaths,
avrg_observed_deaths_ot = avrg_observed_deaths, excess_deaths_ot = excess_deaths,
perc_excess_deaths_ot = perc_excess_deaths, racethn_ot = racethn)
usdat_ot <- subset(usdat_ot, select = c(racethn_ot, projected_deaths_ot, observed_deaths_ot,
avrg_observed_deaths_ot, excess_deaths_ot, perc_excess_deaths_ot))
dim(usdat_ot)[1] 2106 6
usdat_wh <- subset(usdat, racethn == "White")
usdat_wh <- usdat_wh %>% mutate(projected_deaths_wh = projected_deaths_, observed_deaths_wh = observed_deaths,
avrg_observed_deaths_wh = avrg_observed_deaths, excess_deaths_wh = excess_deaths,
perc_excess_deaths_wh = perc_excess_deaths, racethn_wh = racethn)
usdat_wh <- subset(usdat_wh, select = c(racethn_wh, projected_deaths_wh, observed_deaths_wh,
avrg_observed_deaths_wh, excess_deaths_wh, perc_excess_deaths_wh))
dim(usdat_wh)[1] 2106 6
usdat2 <- as.data.frame(cbind(usdat_, usdat_wh, usdat_hi, usdat_bl, usdat_as, usdat_ia,
usdat_ot))
dim(usdat2)[1] 2106 43
[1] 39
[1] "jurisd" "outcome"
[3] "estimatetype" "wkendate"
[5] "mmwryr" "mmwrmo"
[7] "mmwrwk" "racethn_wh"
[9] "projected_deaths_wh" "observed_deaths_wh"
[11] "avrg_observed_deaths_wh" "excess_deaths_wh"
[13] "perc_excess_deaths_wh" "racethn_hi"
[15] "projected_deaths_hi" "observed_deaths_hi"
[17] "avrg_observed_deaths_hi" "excess_deaths_hi"
[19] "perc_excess_deaths_hi" "racethn_bl"
[21] "projected_deaths_bl" "observed_deaths_bl"
[23] "avrg_observed_deaths_bl" "excess_deaths_bl"
[25] "perc_excess_deaths_bl" "racethn_as"
[27] "projected_deaths_as" "observed_deaths_as"
[29] "avrg_observed_deaths_as" "excess_deaths_as"
[31] "perc_excess_deaths_as" "racethn_ia"
[33] "projected_deaths_ia" "observed_deaths_ia"
[35] "avrg_observed_deaths_ia" "excess_deaths_ia"
[37] "perc_excess_deaths_ia" "racethn_ot"
[39] "projected_deaths_ot" "observed_deaths_ot"
[41] "avrg_observed_deaths_ot" "excess_deaths_ot"
[43] "perc_excess_deaths_ot"
[1] 234 14
txdat_ <- subset(txdat, racethn == "Hispa", select = c(jurisd, outcome, estimatetype,
wkendate, mmwryr, mmwrmo, mmwrwk))
dim(txdat_)[1] 39 7
txdat_ia <- subset(txdat, racethn == "IA/AN")
txdat_ia <- txdat_ia %>% mutate(projected_deaths_ia = projected_deaths_, observed_deaths_ia = observed_deaths,
avrg_observed_deaths_ia = avrg_observed_deaths, excess_deaths_ia = excess_deaths,
perc_excess_deaths_ia = perc_excess_deaths, racethn_ia = racethn)
txdat_ia <- subset(txdat_ia, select = c(racethn_ia, projected_deaths_ia, observed_deaths_ia,
avrg_observed_deaths_ia, excess_deaths_ia, perc_excess_deaths_ia))
dim(txdat_ia)[1] 39 6
txdat_as <- subset(txdat, racethn == "Asian")
txdat_as <- txdat_as %>% mutate(projected_deaths_as = projected_deaths_, observed_deaths_as = observed_deaths,
avrg_observed_deaths_as = avrg_observed_deaths, excess_deaths_as = excess_deaths,
perc_excess_deaths_as = perc_excess_deaths, racethn_as = racethn)
txdat_as <- subset(txdat_as, select = c(racethn_as, projected_deaths_as, observed_deaths_as,
avrg_observed_deaths_as, excess_deaths_as, perc_excess_deaths_as))
dim(txdat_as)[1] 39 6
txdat_bl <- subset(txdat, racethn == "Black")
txdat_bl <- txdat_bl %>% mutate(projected_deaths_bl = projected_deaths_, observed_deaths_bl = observed_deaths,
avrg_observed_deaths_bl = avrg_observed_deaths, excess_deaths_bl = excess_deaths,
perc_excess_deaths_bl = perc_excess_deaths, racethn_bl = racethn)
txdat_bl <- subset(txdat_bl, select = c(racethn_bl, projected_deaths_bl, observed_deaths_bl,
avrg_observed_deaths_bl, excess_deaths_bl, perc_excess_deaths_bl))
dim(txdat_bl)[1] 39 6
txdat_hi <- subset(txdat, racethn == "Hispa")
txdat_hi <- txdat_hi %>% mutate(projected_deaths_hi = projected_deaths_, observed_deaths_hi = observed_deaths,
avrg_observed_deaths_hi = avrg_observed_deaths, excess_deaths_hi = excess_deaths,
perc_excess_deaths_hi = perc_excess_deaths, racethn_hi = racethn)
txdat_hi <- subset(txdat_hi, select = c(racethn_hi, projected_deaths_hi, observed_deaths_hi,
avrg_observed_deaths_hi, excess_deaths_hi, perc_excess_deaths_hi))
dim(txdat_hi)[1] 39 6
txdat_ot <- subset(txdat, racethn == "Other")
txdat_ot <- txdat_ot %>% mutate(projected_deaths_ot = projected_deaths_, observed_deaths_ot = observed_deaths,
avrg_observed_deaths_ot = avrg_observed_deaths, excess_deaths_ot = excess_deaths,
perc_excess_deaths_ot = perc_excess_deaths, racethn_ot = racethn)
txdat_ot <- subset(txdat_ot, select = c(racethn_ot, projected_deaths_ot, observed_deaths_ot,
avrg_observed_deaths_ot, excess_deaths_ot, perc_excess_deaths_ot))
dim(txdat_ot)[1] 39 6
txdat_wh <- subset(txdat, racethn == "White")
txdat_wh <- txdat_wh %>% mutate(projected_deaths_wh = projected_deaths_, observed_deaths_wh = observed_deaths,
avrg_observed_deaths_wh = avrg_observed_deaths, excess_deaths_wh = excess_deaths,
perc_excess_deaths_wh = perc_excess_deaths, racethn_wh = racethn)
txdat_wh <- subset(txdat_wh, select = c(racethn_wh, projected_deaths_wh, observed_deaths_wh,
avrg_observed_deaths_wh, excess_deaths_wh, perc_excess_deaths_wh))
dim(txdat_wh)[1] 39 6
txdat2 <- as.data.frame(cbind(txdat_, txdat_wh, txdat_hi, txdat_bl, txdat_as, txdat_ia,
txdat_ot))
dim(txdat2)[1] 39 43
[1] 39
[1] "jurisd" "outcome"
[3] "estimatetype" "wkendate"
[5] "mmwryr" "mmwrmo"
[7] "mmwrwk" "racethn_wh"
[9] "projected_deaths_wh" "observed_deaths_wh"
[11] "avrg_observed_deaths_wh" "excess_deaths_wh"
[13] "perc_excess_deaths_wh" "racethn_hi"
[15] "projected_deaths_hi" "observed_deaths_hi"
[17] "avrg_observed_deaths_hi" "excess_deaths_hi"
[19] "perc_excess_deaths_hi" "racethn_bl"
[21] "projected_deaths_bl" "observed_deaths_bl"
[23] "avrg_observed_deaths_bl" "excess_deaths_bl"
[25] "perc_excess_deaths_bl" "racethn_as"
[27] "projected_deaths_as" "observed_deaths_as"
[29] "avrg_observed_deaths_as" "excess_deaths_as"
[31] "perc_excess_deaths_as" "racethn_ia"
[33] "projected_deaths_ia" "observed_deaths_ia"
[35] "avrg_observed_deaths_ia" "excess_deaths_ia"
[37] "perc_excess_deaths_ia" "racethn_ot"
[39] "projected_deaths_ot" "observed_deaths_ot"
[41] "avrg_observed_deaths_ot" "excess_deaths_ot"
[43] "perc_excess_deaths_ot"
Since I work in R environment, I prefer and recommended saving data in base R format using .rda or .Rdata extension with compress= option. The function load() is used to call .rda files into R session. Saving data in this format compresses the file and saves significant space on disk. The save() command is very flexible and can save any file object.
save(usdat, file = "../ExcessDeaths/data/usdat_long.rda", compress = "xz")
save(usdat2, file = "../ExcessDeaths/data/usdat_wide.rda", compress = "xz")
save(txdat, file = "../ExcessDeaths/data/txdat_long.rda", compress = "xz")
save(txdat2, file = "../ExcessDeaths/data/txdat_wide.rda", compress = "xz")
save(txdat_wh, file = "../ExcessDeaths/data/txdat_white.rda", compress = "xz")
save(txdat_hi, file = "../ExcessDeaths/data/txdat_hispc.rda", compress = "xz")
save(txdat_bl, file = "../ExcessDeaths/data/txdat_black.rda", compress = "xz")
save(txdat_as, file = "../ExcessDeaths/data/txdat_asian.rda", compress = "xz")
save(txdat_ia, file = "../ExcessDeaths/data/txdat_indal.rda", compress = "xz")
save(txdat_ot, file = "../ExcessDeaths/data/txdat_other.rda", compress = "xz")write_csv(usdat, path = "../ExcessDeaths/data/usdat_long.csv")
write_csv(usdat2, path = "../ExcessDeaths/data/usdat_wide.csv")
write_csv(txdat, path = "../ExcessDeaths/data/txdat_long.csv")
write_csv(txdat2, path = "../ExcessDeaths/data/txdat_wide.csv")
write_csv(txdat_wh, path = "../ExcessDeaths/data/txdat_white.csv")
write_csv(txdat_hi, path = "../ExcessDeaths/data/txdat_hispc.csv")
write_csv(txdat_bl, path = "../ExcessDeaths/data/txdat_black.csv")
write_csv(txdat_as, path = "../ExcessDeaths/data/txdat_asian.csv")
write_csv(txdat_ia, path = "../ExcessDeaths/data/txdat_indal.csv")
write_csv(txdat_ot, path = "../ExcessDeaths/data/txdat_other.csv")Any question or suggestion?
I recommend to you all to explore below resources for detailed explanation and easy to adapt examples.
Corey Sparks, Ph.D. lab materials.
Ben Baumer and R. Jordan Crouser lab materials.
CDC/NCHS’s Main Database with major reference to data on Provisional COVID-19 Death Counts by Sex, Age, and State, United States COVID-19 Cases and Deaths by State over Time, Provisional COVID-19 Death Counts by Week Ending Date and State, and Provisional COVID-19 Death Counts in the United States by County.
R package was used for data wrangling and analysis.
Created on October 20, 2020