Data Processing Techniques for Large Data Sets

R Final Project

Tonguç Yaman

2020-04-29

Hypothesis - Data Sets will keep getting larger

In other words: Disks are always full. It is futile to try to get more disk space. Data expands to fill any void (Murphy’s Law).

Magnitudes of Storage Capacity for well known systems

Large Complex Datasets

Best practices taught in the class

Libraries used for this presentation:

Epidemiologic dataset can get very large and cumbersome to manipulate

Access NHANES dataset using

library(nhanesA)

DEMO_I includes Demographic Variables and Sample Weights - Year 2015-2016

nhanesDemo <- nhanes("DEMO_I")

Demonstrate the differences between population distributions between unweighted interview sample and the weighted interview sample using survey::svymean

Same Code for a different NHANES data requires the change of one letter

Access NHANES dataset using

library(nhanesA)

DEMO_H includes Demographic Variables and Sample Weights - Year 2013-2014

nhanesDemo <- nhanes("DEMO_H")

Demonstrate the differences between population distributions between unweighted interview sample and the weighted interview sample using survey::svymean

Flattening the curve before it flattens us: hospital critical care capacity limits and mortality from novel coronavirus (SARS-CoV2) cases in US counties. Branas et al. (2020)

The preprint article is based on simulation of COVID cases in the US with respect social mobility. The model predicts the ICU bed availability contingent upon various parameters, at county level granularity

Challenges:

Simulation Data

## [1] "Benchmark 1 (data staging process):"
## Time difference of 35.24 secs
## [1] "Benchmark 2 (read data frame for user interface):"
##    user  system elapsed 
##    1.17    0.38    1.66

Data Files

References

Branas, Charles C, Andrew Rundle, Sen Pei, Wan Yang, Brendan G Carr, Sarah Sims, Alexis Zebrowski, et al. 2020. “Flattening the curve before it flattens us: hospital critical care capacity limits and mortality from novel coronavirus (SARS-CoV2) cases in US counties.” medRxiv, April, 2020.04.01.20049759. https://doi.org/10.1101/2020.04.01.20049759.

Bui, Quoctrung, Josh Katz, Alicia Parlapiano, and Margot Sanger-Katz. 2020. “What 5 Coronavirus Models Say the Next Month Will Look Like - The New York Times.” New York, NY. https://www.nytimes.com/interactive/2020/04/22/upshot/coronavirus-models.html.

CDC. 2020. “NHANES - National Health and Nutrition Examination Survey Homepage.” https://www.cdc.gov/nchs/nhanes/index.htm.

Etherington, Darrell. 2020. “Latest COVID-19 projections from Columbia University show mid-May spike if social distancing is relaxed | TechCrunch.” TechCrunch, April. https://techcrunch.com/2020/04/22/latest-covid-19-projections-from-columbia-university-show-mid-may-spike-if-social-distancing-is-relaxed/.

Klik, Mark. 2020. “Lightning Fast Serialization of Data Frames for R • fst.” https://www.fstpackage.org/.

Lumley, Thomas. 2020. “Survey analysis in R.” http://r-survey.r-forge.r-project.org/survey/.

Paul, Timothy S. 2020. “Latest COVID-19 Projections Point to Spring Peak | Columbia University Mailman School of Public Health.” https://www.mailman.columbia.edu/public-health-now/news/latest-covid-19-projections-point-spring-peak.

Yaman, Tonguc. 2020. “Estimated Daily COVID-19 Cases and Available Hospital Critical Care Beds.” https://cuepi.shinyapps.io/COVID-19/.