An Introduction to Data Wizardry

0.1 Introduction

So you want to be a data magician, hey? In today’s workshop, you will be introduced to mindfulness when it comes to data structure, keeping good documention on your data cleaning and experiments, performing some basic manipulations and we’ll be finishing off with data visualisations.

Given the scope and the small timeframe, we will be barely scraping the surface but I am hoping in this way you may see further utility for R, and the RStudio IDE.

…

0.1.1 Prerequisites

Before you get started, we need to make sure you have R and RStudio installed on your machine. If you don’t have these installed on your machine, the following are instructions for you to install this software.

0.1.2 Installing R

R is a powerful, open-source statistical programming language that anyone can download for free! You can simply go to CRAN’s website and install R by following these steps:

0.1.2.1 Under the ‘Download’ heading, select ‘CRAN’

0.1.2.2 Install a CRAN Mirror

Install the CRAN mirror that’s nearest to your geographic location.

0.1.2.3 Install your Machine’s Version of R

0.1.2.3.1 R for Windows

Install R for the first time.

After you click this link, follow the instructions given in the installation.

0.1.2.3.2 R for (Mac) OS X

After you click this link, follow the instructions given in the installation.

0.1.2.3.3 R for Linux

Install the R version relevant to your Linux server, and follow the instructions given in the installation.

0.1.3 Installing RStudio

RStudio is an open-source professional software that makes R much easier to use. Download the free, open-source license version from RStudio’s website. The installation steps are very similar to those of R’s for all operating systems.

0.1.4 Updating R and RStudio

If the aforementioned packages and functions start to not work after an extended period of time, you may need to update your versions of R, R packages, and RStudio software to the latest versions.

0.1.4.1 Updating R

To update your version of R, first close any R or RStudio windows you have open.

0.1.4.1.1 Updating R on Windows

Open the R GUI (x64, not i386). This is not the same as RStudio. The R GUI program icon should look very similar to this:

Install the installr package into R by typing

install.packages("installr")

into the Console.

After the package is finished installing, call the package by typing

library(installr)

into the Console. You can now update your R software and packages to the latest versions by typing

updateR()

into the Console. Then R will walk you through a detailed and intuitive process of updating your R software and packages to the latest versions.

0.1.4.1.2 Updating R on (Mac) OS X

Open RStudio again, and type the following lines of code into the Console:

install.packages('devtools') #assuming it isn't already installed
library(devtools)
install_github('andreacirilloac/updateR')
library(updateR)
updateR(admin_password = "os_admin_user_password")

R will then walk you through a detailed and intuitive process of updating your R software and packages to the latest versions.

0.1.4.1.3 Updating R on Linux

This resource will walk you through how to update your R software and packages to the latest versions on Linux.

0.1.4.2 Updating R Packages Without Updating R

Updating out-of-date packages that were installed from CRAN (with install.packages()) is easy with the update.packages() function. Type this function into the RStudio Console.

update.packages()

After entering this function, it will ask you what packages you want to update. To update all packages at once, use ask = FALSE.

update.packages(ask = FALSE)

To update packages installed from devtools::install_github(), type the following function into your RStudio Console (I would also recommend saving this function in an R Script for later use):

update_github_pkgs <- function() {
  # check/load necessary packages
  # devtools package
  if (!("package:devtools" %in% search())) {
    tryCatch(require(devtools), error = function(x) {warning(x); cat("Cannot load devtools package \n")})
    on.exit(detach("package:devtools", unload=TRUE))
  }

  pkgs <- installed.packages(fields = "RemoteType")
  github_pkgs <- pkgs[pkgs[, "RemoteType"] %in% "github", "Package"]

  print(github_pkgs)
  lapply(github_pkgs, function(pac) {
    message("Updating ", pac, " from GitHub...")

    repo = packageDescription(pac, fields = "GithubRepo")
    username = packageDescription(pac, fields = "GithubUsername")

    install_github(repo = paste0(username, "/", repo))
  })
}

Then call the function.

update_github_pkgs()

0.1.4.3 Updating RStudio

To update RStudio, open RStudio and go to Help > Check for Updates to install the newest version.

0.1.5 RStudio Setup and Navigation

Below gives the general layout of the RStudio IDE. In summary:

The top left-hand panel is where you write your code to be saved for future use/reference.
The bottom left-hand panel is where you see what code has been executed.
The top right-hand panel is where you can see what datasets are in your environment, as well as history of codes.
The bottom right-hand panel is where you can see the files in your current directory, view plots, view loaded packages or update/install packages, and look for help documentation.

We’ll get back to this later!!

0.2 Data Structure

Data structure refers to the way data is organised and manipulated. It seeks to find ways to make data access more efficient.

When dealing with data structure, we not only focus on one piece of data, but rather different sets of data and how they can relate to one another in an organised manner. Good data structure is the foundation of all analyses.

0.2.1 Best Practices in Data Structuring

Spreadsheets are good for data entry. Therefore, we generally store a lot of data in spreadsheets. Much of a researchers time will be spent in the preparation phases (see below).

It’s not the most fun you can have with data, but it’s necessary. Since it takes up so much time, it is good to have a process/pipeline for doing this that you follow, with thorough documentation. You’ll also love yourself for doing it correctly the first time (oh, I have stories!).

0.2.2 Things to Consider when Preparing Data

There are many things to consider when you are preparing your dataset for not only analysis but storage, since (in theory) it should be easy for someone to pick up where you left off.

Is it in the right format (i.e. did you convert numbers stored as text strings into numeric values, format dates, etc)?
Is it consistent and comparable?
Are there spelling mistakes in the variables or the values?
Do you have missing data? Have you considered imputation?
Are you certain the data you are cleaning was collected correctly?
Were your actions to clean the data justified or have you maybe gone too far (beware!!)?
Is what you did in the data cleaning steps repeatable?
Did you get rid of redundant blank spaces?
Did you remove duplicates?
Have you changed text to lower/upper/proper case consistently?
Etc…

0.2.3 Preparing for Data Analysis Begins Far Sooner

One final note before we get our hands stuck into some ‘terrible’ data, the data cleaning process begins before any data is collected (well, at least it should be).

In fact, a significant part of my role as a Data Officer at the Raine Study has been preparing/checking data collection protocols and forms for new follow-ups to make the data cleaning phases easier (i.e. if we ask for dates, electonic fields will only accept date values, etc…). Much of these skills come with practise, so the more you practise and brainstorm how you want your data to look like in the end, data cleaning will become easier.

NOTE: Remember… Shit in, shit out!

Below I’ll show you some practices for data organisation. Hereon has been modified to give an example as discussed in the workshop delivered on 26th of October 2019 @ UWA. This dataset will be used in a proposed part 2 workshop.

0.2.4 It’s Spreadsheet Time!

Before we download the data, we’re going to create a folder structure for storing everything.

Create a subfolder in your Documents folder called EcologyPractise and the subfolders EcologyPratise/data and EcologyPratise/images. Your subfolders should look like this:

Download the following data from this link and save in the data subfolder: click here for data and download below files

plots.csv
species.csv
surveys.csv

Open these data subfolder to inspect (should look something like this):

0.2.4.1 Reorganising the datasets

For the purposes of this example, we’re going to create a main dataset in Excel format with tabs to give value and variable labels, which may be called into R (as discussed in the workshop). This will be used in a Part 2 to this workshop, which will be posted by the end of the month.

First we will open the dataset surveys.csv. Inspecting the data we can see a few things we need to fix before continuing, including the date variable (currently split into day, month and year), numerically code nominal variables (i.e. species and sex here) and perhaps add some information about units of measurement (i.e. in variable labels).

Generate date first using Excel function (in picture) then drag/copy the formula down the column (easiest way is to ctrl + c on keyboard, scroll to the bottom, then on the last cell of the column crtl + Shift + v). Delete day, month and year, and save file as surveys_data.xlsx in the data subfolder. You will see why shortly.

We are going to copy and paste the plot.csv data into its own tab in the new Excel dataset called plot_info. Likewise, we will copy and paste the species.csv data into its own tab called species_info.

The rest of the steps will be performed in an RMarkdown document so you can keep a record of your steps. This is an extension of what was presented in the workshop but hopefully easy to follow.

0.3 Sound Documentation

An important part of data cleaning is making sure to keep good documentation. One such example of this is using rmarkdown, a package within the tidyverse.

NOTE: Today’s workshop was written in rmarkdown and rendered with knitr, both tidyverse packages (back to this later). You can find the cheatsheet here: [https://rmarkdown.rstudio.com/lesson-15.htmlf]

0.3.1 Why R Markdown?

Easy to integrate data directly into other documents
No copy/paste -> less margin for error
Much simpler to learn compared to other tools/languages, like LaTex
Reference management integration: easy to cite relevant papers and autogenerate bibliographies

0.3.2 Getting Started in RMarkdown

While editing this workshop material, Emi Tanaka (one of the regular contributors to R-Ladies Sydney) uploaded this fantastic workshop in RMarkdown. I learnt a bit from it as well! I highly recommend following this link. It is run using RStudio Cloud; a resource introduced in our first R-Ladies Beginner workshop. It allows you to work through the materials through a RStudio interface on a web browser.

Remember to read the download instructions on the above link before following the slides.

0.4 Resources

RLadies Global main website: [https://rladies.org/]
Free RMarkdown tutorial on the RStudio Website: [https://rmarkdown.rstudio.com/]
Cheatsheets on DataCamp: [https://www.datacamp.com/community/data-science-cheatsheets]
If you want to do more with the ecology or other data, check out Data Carpentries for more lessons: [https://datacarpentry.org/spreadsheet-ecology-lesson/]
An (albeit older) workshop on reproducible research: [http://bcb.dfci.harvard.edu/~aedin/courses/ReproducibleResearch/]
Reproducible Research Resources in R: [https://cran.r-project.org/web/views/ReproducibleResearch.html]