Charlie Labuzzetta
September 2, 2020
The 3 R's of Data Management:
For important data / files, you need to have a backup of your data (at least one, but preferrably more)!
Example:
Separate file system into major professional events. Further breakdown by courses, research projects, extracurricular programs, personal files.
phd
Iowa State University provides free unlimited cloud storage via Box.com
A helpful tool to enable automatic backup of your data to Box.com, is to use BoxSync.
Follow the link below for more information on installing BoxSync on Windows or Mac computers:
https://support.box.com/hc/en-us/articles/360043697194-Installing-Box-Sync
Within your backed up personal file system, you will likely have data related to your research.
After you have a organized file system, you can worry about your research projects:
GitHub is a web platform for tracking changes to your computer codes and scripts, that you should be using in combination with RStudio, whenever you work with data!
If you don't have an account: https://www.github.com
Once you have an account, add the free education pack through ISU: https://education.github.com/pack
Using GitHub and RStudio, you can create a data project which is:
Redundant: Backed up and tracked via GitHub
Reliable: Your raw data is never altered directly
Reproducible: Other scientists can easily understand and access your work
Project Subdirectories:
data-raw: Put the .csv and other data files from your experiments here NEVER TOUCH AGAIN
data: Cleaned data tables, with nicely formated column names, data types, outliers removed
R: Metadata files which describe the cleaned data tables. Might include data units or collection methods
vignettes: Scripts that generate your analyses / graphics
presentation: Scripts to generate a presenetation with easy access to graphs / tables
Dr. Lisa Schulte-Moore's STRIPS (Science-based Trials of Rowcrops Integrated with Prairie Strips) Project is one group that has transitioned to using the R Data Package Format.
As an example, we will look into the challenges of storing and analyzing (fake) data similar to those collected by the STRIPS project and examine how the R Data Package format would be useful.
When we take a look back our raw data, there may be many inconsistencies that we'd like to change:
In the column names, look for:
| Observer | Location | SPECIES | Date | Time | distance(m) | How recognized? | making sound |
|---|---|---|---|---|---|---|---|
| JD | Riverside | AMRO | 2016-05-11 | 1899-01-01 19:01:00 | 45 | Visual | 0 |
| JD | Riverside Park | EABL | 05/11/2016 | 1899-01-01 19:53:00 | 90 | visual | 0 |
| mj | Backyard | DOWO | 2016-05-16 | 1899-01-01 16:27:00 | 24 | Visual | 1 |
| Jd | Backyard | AMGO | 05-17-2016 | 1899-01-01 06:55:00 | 53 | Visual | 0 |
| Mj | backyard | CLSW | 2016-06-03 | 1899-01-01 21:43:00 | 31 | visual | 0 |
| Mj | backyard | GHOW | 06-03-2016 | 1899-01-01 22:24:00 | 20 | auditory | 1 |
| jd | Park | EABL | 06/06/2016 | 1899-01-01 18:28:00 | 34 | visual | 0 |
| MJ | in Riverside park | AMRO | 2016-06-17 | 1899-01-01 14:43:00 | 20 | heard it | 1 |
| MJ | Riverside | GBHE | 2016-06-17 | 1899-01-01 15:07:00 | 60 | visual | 0 |
| mj | RP | PUFI | 06/17/2016 | 15:45:00 | 40 | visual | 0 |
| mj | RP | EABL | 2016-06-17 | 16:00:00 | 34 | visual | 0 |
| Jd | yard | AMGO | 2016-06-18 | 07:42:00 | 23 | visual | 0 |
In the column names, look for:
| sighting_id | observer_id | location_id | species_id | date | visual_recognition | audio_recognition |
|---|---|---|---|---|---|---|
| 1 | 492 | 1 | AMRO | 2016-05-11 19:01:00 | yes | no |
| 2 | 492 | 1 | EABL | 2016-05-11 19:53:00 | yes | no |
| 3 | 213 | 2 | DOWO | 2016-05-16 16:27:00 | yes | yes |
| 4 | 492 | 3 | AMGO | 2016-05-17 06:55:00 | yes | no |
| 5 | 213 | 2 | CLSW | 2016-06-03 21:43:00 | yes | no |
| 6 | 213 | 2 | GHOW | 2016-06-03 22:24:00 | no | yes |
| 7 | 492 | 1 | EABL | 2016-06-06 18:28:00 | yes | no |
| 8 | 213 | 1 | AMRO | 2016-06-17 14:43:00 | yes | yes |
| 9 | 213 | 1 | GBHE | 2016-06-17 15:07:00 | yes | no |
| 10 | 213 | 1 | PUFI | 2016-06-17 15:45:00 | yes | no |
| 11 | 213 | 1 | EABL | 2016-06-17 16:00:00 | yes | no |
| 12 | 492 | 3 | AMGO | 2016-06-18 07:42:00 | yes | no |
There are many ways to name files / columns with consistency:
Suggestion: lowercase_underscore
Important People in the “R” World suggest: http://r-pkgs.had.co.nz/style.html
There are so many ways to store dates:
| Observer | Location | SPECIES | Date | Time | distance-meters | How recognized? | making sound |
|---|---|---|---|---|---|---|---|
| JD | Riverside | AMRO | 2016-05-11 | 1899-01-01 19:01:00 | 45 | Visual | 0 |
| JD | Riverside Park | EABL | 05/11/2016 | 1899-01-01 19:53:00 | 90 | visual | 0 |
| mj | Backyard | DOWO | 2016-05-16 | 1899-01-01 16:27:00 | 24 | Visual | 1 |
| Jd | Backyard | AMGO | 05-17-2016 | 1899-01-01 06:55:00 | 53 | Visual | 0 |
| Mj | backyard | CLSW | 2016-06-03 | 1899-01-01 21:43:00 | 31 | visual | 0 |
| Mj | backyard | GHOW | 06-03-2016 | 1899-01-01 22:24:00 | 20 | auditory | 1 |
| jd | Park | EABL | 06/06/2016 | 1899-01-01 18:28:00 | 34 | visual | 0 |
| MJ | in Riverside park | AMRO | 2016-06-17 | 1899-01-01 14:43:00 | 20 | heard it | 1 |
| MJ | Riverside | GBHE | 2016-06-17 | 1899-01-01 15:07:00 | 60 | visual | 0 |
| mj | RP | PUFI | 06/17/2016 | 15:45:00 | 40 | visual | 0 |
| mj | RP | EABL | 2016-06-17 | 16:00:00 | 34 | visual | 0 |
| Jd | yard | AMGO | 2016-06-18 | 07:42:00 | 23 | visual | 0 |
Suggestion: When possible, YYYY-mm-dd or YYYY/mm/dd or YYYYmmdd, and hh:mm:ss or hhmmss
Main objective: Choose one preferred format
Suggestion: If using R, look into the lubridate package
| sighting_id | observer_id | location_id | species_id | date | time | visual_recognition | audio_recognition |
|---|---|---|---|---|---|---|---|
| 1 | 492 | 1 | AMRO | 2016-05-11 | 19:01:00 | yes | no |
| 2 | 492 | 1 | EABL | 2016-05-11 | 19:53:00 | yes | no |
| 3 | 213 | 2 | DOWO | 2016-05-16 | 16:27:00 | yes | yes |
| 4 | 492 | 3 | AMGO | 2016-05-17 | 06:55:00 | yes | no |
| 5 | 213 | 2 | CLSW | 2016-06-03 | 21:43:00 | yes | no |
| 6 | 213 | 2 | GHOW | 2016-06-03 | 22:24:00 | no | yes |
| 7 | 492 | 1 | EABL | 2016-06-06 | 18:28:00 | yes | no |
| 8 | 213 | 1 | AMRO | 2016-06-17 | 14:43:00 | yes | yes |
| 9 | 213 | 1 | GBHE | 2016-06-17 | 15:07:00 | yes | no |
| 10 | 213 | 1 | PUFI | 2016-06-17 | 15:45:00 | yes | no |
| 11 | 213 | 1 | EABL | 2016-06-17 | 16:00:00 | yes | no |
| 12 | 492 | 3 | AMGO | 2016-06-18 | 07:42:00 | yes | no |
How/where should measurement units be stored?
What's wrong with the table below?
| Observer | Location | SPECIES | Date | Time | distance-meters | How recognized? | making sound |
|---|---|---|---|---|---|---|---|
| JD | Riverside | AMRO | 2016-05-11 | 1899-01-01 19:01:00 | 45 | Visual | 0 |
| JD | Riverside Park | EABL | 05/11/2016 | 1899-01-01 19:53:00 | 90 | visual | 0 |
| mj | Backyard | DOWO | 2016-05-16 | 1899-01-01 16:27:00 | 24 | Visual | 1 |
| Jd | Backyard | AMGO | 05-17-2016 | 1899-01-01 06:55:00 | 53 | Visual | 0 |
| Mj | backyard | CLSW | 2016-06-03 | 1899-01-01 21:43:00 | 31 | visual | 0 |
| Mj | backyard | GHOW | 06-03-2016 | 1899-01-01 22:24:00 | 20 | auditory | 1 |
| jd | Park | EABL | 06/06/2016 | 1899-01-01 18:28:00 | 34 | visual | 0 |
| MJ | in Riverside park | AMRO | 2016-06-17 | 1899-01-01 14:43:00 | 20 | heard it | 1 |
| MJ | Riverside | GBHE | 2016-06-17 | 1899-01-01 15:07:00 | 60 | visual | 0 |
| mj | RP | PUFI | 06/17/2016 | 15:45:00 | 40 | visual | 0 |
| mj | RP | EABL | 2016-06-17 | 16:00:00 | 34 | visual | 0 |
| Jd | yard | AMGO | 2016-06-18 | 07:42:00 | 23 | visual | 0 |
Again, when making an R Data Package, look at: https://github.com/jarad/RDataPackageTemplate
How to think about database design:
A Simple Example:
Single Table Format:
| type | color | store_1_price | store_2_price |
|---|---|---|---|
| apple | red | 0.29 | 0.27 |
Instead we could use a relational database format:
Fruits
| fruit_id | type | color |
|---|---|---|
| 29 | apple | red |
Stores
| fruit_id | price | store_id |
|---|---|---|
| 29 | 0.29 | 1 |
| 29 | 0.27 | 2 |
Consider the following scenario:
How could this data be managed?
| Observer | Location | SPECIES_1 | Date_1 | Time_1 | SPECIES_2 | Date_2 | Time_2 |
|---|---|---|---|---|---|---|---|
| JD | Riverside | AMRO | 2016-05-11 | 01/01/1899 19:01:00 | EABL | 2016-05-11 | 01/01/1899 19:53:00 |
| Jd | Backyard | AMGO | 2016-05-17 | 01/01/1899 06:55:00 | AMGO | 2016-06-18 | 01/01/1899 07:42:00 |
| mj | Backyard | DOWO | 2016-05-16 | 01/01/1899 16:27:00 | CLSW | 2016-06-03 | 01/01/1899 21:43:00 |
| MJ | in Riverside park | AMRO | 2016-06-17 | 01/01/1899 14:43:00 | GBHE | 2016-06-17 | 01/01/1899 15:07:00 |
Avoid using repetitive groups of columns
See dataManagement::sightings_really_bad for the full table.
| observer_id | first_name | last_name | birth_date | |
|---|---|---|---|---|
| 492 | John | Doe | 1987-09-03 | john.doe@gmail.com |
| 213 | Mary | Jane | 1959-02-27 | mary.jane@gmail.com |
| location_id | location_name | street_address | city | state | country |
|---|---|---|---|---|---|
| 1 | Riverside Park | 100 State St | La Crosse | WI | USA |
| 3 | Backyard | 418 Red Apple Dr | La Crescent | MN | USA |
| sighting_id | observer_id | location_id | species_id | date |
|---|---|---|---|---|
| 11 | 213 | 1 | EABL | 2016-06-17 16:00:00 |
| 12 | 492 | 3 | AMGO | 2016-06-18 07:42:00 |
| species_id | genus | species | common_name |
|---|---|---|---|
| EABL | Sialia | sialis | Eastern Bluebird |
| AMGO | Spinus | tristis | American Goldfinch |
Relational database design helps to:
Look further at the following tables and documentation:
To view documentation for a dataset:
The 3 R's of Data Management:
Organize your personal file system
Backup your data and install BoxSync
Git your GitHub account and Github Education Pack
Store research data in R Data Packages
Never touch your raw data again!
Support reproducible research by sharing your repositories on GitHub
Introduction to RStudio
Introduction to GitHub
Making an R Data Package
Introduction to Tidyverse
Using GitHub within RStudio
You can find this presentation and the associated data on my GitHub page:
https://github.com/labuzzetta/dataManagement
Instructions for installing the package from GitHub are listed on the webpage.
A link to an online version of the presentation alone is: