So you want to be a data magician, hey? In today’s workshop, you will be introduced to mindfulness when it comes to data structure, keeping good documention on your data cleaning and experiments, performing some basic manipulations and we’ll be finishing off with data visualisations.
Given the scope and the small timeframe, we will be barely scraping the surface but I am hoping in this way you may see further utility for R, and the RStudio IDE.
…
Before you get started, we need to make sure you have R and RStudio installed on your machine. If you don’t have these installed on your machine, the following are instructions for you to install this software.
R is a powerful, open-source statistical programming language that anyone can download for free! You can simply go to CRAN’s website and install R by following these steps:
Install the CRAN mirror that’s nearest to your geographic location.
Install R for the first time.
After you click this link, follow the instructions given in the installation.
After you click this link, follow the instructions given in the installation.
Install the R version relevant to your Linux server, and follow the instructions given in the installation.
RStudio is an open-source professional software that makes R much easier to use. Download the free, open-source license version from RStudio’s website. The installation steps are very similar to those of R’s for all operating systems.
If the aforementioned packages and functions start to not work after an extended period of time, you may need to update your versions of R, R packages, and RStudio software to the latest versions.
To update your version of R, first close any R or RStudio windows you have open.
Open the R GUI (x64, not i386). This is not the same as RStudio. The R GUI program icon should look very similar to this:
Install the installr package into R by typing
install.packages("installr")
into the Console.
After the package is finished installing, call the package by typing
library(installr)
into the Console. You can now update your R software and packages to the latest versions by typing
updateR()
into the Console. Then R will walk you through a detailed and intuitive process of updating your R software and packages to the latest versions.
Open RStudio again, and type the following lines of code into the Console:
install.packages('devtools') #assuming it isn't already installed
library(devtools)
install_github('andreacirilloac/updateR')
library(updateR)
updateR(admin_password = "os_admin_user_password")
R will then walk you through a detailed and intuitive process of updating your R software and packages to the latest versions.
This resource will walk you through how to update your R software and packages to the latest versions on Linux.
Updating out-of-date packages that were installed from CRAN (with install.packages()) is easy with the update.packages() function. Type this function into the RStudio Console.
update.packages()
After entering this function, it will ask you what packages you want to update. To update all packages at once, use ask = FALSE.
update.packages(ask = FALSE)
To update packages installed from devtools::install_github(), type the following function into your RStudio Console (I would also recommend saving this function in an R Script for later use):
update_github_pkgs <- function() {
# check/load necessary packages
# devtools package
if (!("package:devtools" %in% search())) {
tryCatch(require(devtools), error = function(x) {warning(x); cat("Cannot load devtools package \n")})
on.exit(detach("package:devtools", unload=TRUE))
}
pkgs <- installed.packages(fields = "RemoteType")
github_pkgs <- pkgs[pkgs[, "RemoteType"] %in% "github", "Package"]
print(github_pkgs)
lapply(github_pkgs, function(pac) {
message("Updating ", pac, " from GitHub...")
repo = packageDescription(pac, fields = "GithubRepo")
username = packageDescription(pac, fields = "GithubUsername")
install_github(repo = paste0(username, "/", repo))
})
}
Then call the function.
update_github_pkgs()
To update RStudio, open RStudio and go to Help > Check for Updates to install the newest version.
Data structure refers to the way data is organised and manipulated. It seeks to find ways to make data access more efficient.
When dealing with data structure, we not only focus on one piece of data, but rather different sets of data and how they can relate to one another in an organised manner. Good data structure is the foundation of all analyses.
Spreadsheets are good for data entry. Therefore, we generally store a lot of data in spreadsheets. Much of a researchers time will be spent in the preparation phases (see below).
It’s not the most fun you can have with data, but it’s necessary. Since it takes up so much time, it is good to have a process/pipeline for doing this that you follow, with thorough documentation. You’ll also love yourself for doing it correctly the first time (oh, I have stories!).
There are many things to consider when you are preparing your dataset for not only analysis but storage, since (in theory) it should be easy for someone to pick up where you left off.
Is it in the right format (i.e. did you convert numbers stored as text strings into numeric values, format dates, etc)?
Is it consistent and comparable?
Are there spelling mistakes in the variables or the values?
Do you have missing data? Have you considered imputation?
Are you certain the data you are cleaning was collected correctly?
Were your actions to clean the data justified or have you maybe gone too far (beware!!)?
Is what you did in the data cleaning steps repeatable?
Did you get rid of redundant blank spaces?
Did you remove duplicates?
Have you changed text to lower/upper/proper case consistently?
Etc…
One final note before we get our hands stuck into some ‘terrible’ data, the data cleaning process begins before any data is collected (well, at least it should be).
In fact, a significant part of my role as a Data Officer at the Raine Study has been preparing/checking data collection protocols and forms for new follow-ups to make the data cleaning phases easier (i.e. if we ask for dates, electonic fields will only accept date values, etc…). Much of these skills come with practise, so the more you practise and brainstorm how you want your data to look like in the end, data cleaning will become easier.
NOTE: Remember… Shit in, shit out!
Below I’ll show you some practices for data organisation. Hereon has been modified to give an example as discussed in the workshop delivered on 26th of October 2019 @ UWA. This dataset will be used in a proposed part 2 workshop.
Before we download the data, we’re going to create a folder structure for storing everything.
Create a subfolder in your Documents folder called EcologyPractise and the subfolders EcologyPratise/data and EcologyPratise/images. Your subfolders should look like this:
Download the following data from this link and save in the data subfolder: click here for data and download below files
plots.csv
species.csv
surveys.csv
Open these data subfolder to inspect (should look something like this):
For the purposes of this example, we’re going to create a main dataset in Excel format with tabs to give value and variable labels, which may be called into R (as discussed in the workshop). This will be used in a Part 2 to this workshop, which will be posted by the end of the month.
surveys.csv. Inspecting the data we can see a few things we need to fix before continuing, including the date variable (currently split into day, month and year), numerically code nominal variables (i.e. species and sex here) and perhaps add some information about units of measurement (i.e. in variable labels).ctrl + c on keyboard, scroll to the bottom, then on the last cell of the column crtl + Shift + v). Delete day, month and year, and save file as surveys_data.xlsx in the data subfolder. You will see why shortly.The rest of the steps will be performed in an RMarkdown document so you can keep a record of your steps. This is an extension of what was presented in the workshop but hopefully easy to follow.
An important part of data cleaning is making sure to keep good documentation. One such example of this is using rmarkdown, a package within the tidyverse.
NOTE: Today’s workshop was written in rmarkdown and rendered with knitr, both tidyverse packages (back to this later). You can find the cheatsheet here: [https://rmarkdown.rstudio.com/lesson-15.htmlf]
Easy to integrate data directly into other documents
No copy/paste -> less margin for error
Much simpler to learn compared to other tools/languages, like LaTex
Reference management integration: easy to cite relevant papers and autogenerate bibliographies
While editing this workshop material, Emi Tanaka (one of the regular contributors to R-Ladies Sydney) uploaded this fantastic workshop in RMarkdown. I learnt a bit from it as well! I highly recommend following this link. It is run using RStudio Cloud; a resource introduced in our first R-Ladies Beginner workshop. It allows you to work through the materials through a RStudio interface on a web browser.
Remember to read the download instructions on the above link before following the slides.
RLadies Global main website: [https://rladies.org/]
Free RMarkdown tutorial on the RStudio Website: [https://rmarkdown.rstudio.com/]
Cheatsheets on DataCamp: [https://www.datacamp.com/community/data-science-cheatsheets]
If you want to do more with the ecology or other data, check out Data Carpentries for more lessons: [https://datacarpentry.org/spreadsheet-ecology-lesson/]
An (albeit older) workshop on reproducible research: [http://bcb.dfci.harvard.edu/~aedin/courses/ReproducibleResearch/]
Reproducible Research Resources in R: [https://cran.r-project.org/web/views/ReproducibleResearch.html]