This session, we will focus on taking baby steps with R.
We will learn how to:
The final step involves three aspects of the data that I think are important to look for.
These are:
You will need to sign up for an Rstudio cloud account when you click on the link for our project here.
R
can be unforgiving. It is case-sensitive and if you don’t get function and object names exactly right, it will not understand what you are trying to tell it. R, although very sophisticated, is a dumb machine! Also, it is useful to litter your code with comments. These can be done, simply by using a #
sign in front of what you are writing.
You will notice I have some comments to help orient me in the code on Rstudio
Rstudio has four different parts and at first it can be a bit intimidating. Let’s explore the different parts of the program before we start.It seems strange, but the Source
pane sends commands to the Console
or Terminal
. This is so that you have a record of what you’ve done as your Source
file, which is plain text. The Console
is like the brains of the operation and executes your commands. The right upper pane shows the data.frame
s or datasets that we are working on and the bottom right pane has different tabs for our files, plots, packages and help.
R is like a ‘go cart’ that many, many, many people have contributed to and has been turned into a Rolls-Royce. Basic mathematical and statistical functions are ‘built-in’ to R, but libraries are little ‘packages’ of code with functions that allow us to extend the use of R.
Four libraries will help us achieve our goals today will be:
rio
which helps us import a variety of datasets. See here for more information.
magrittr
allows us to use the pipe operator in R. More information about the magrittr package is available here. This will be explained later.
visdat
This helps us see missing data in our data.frame
in a visual manner.
skimr
skimr
is a package for summarising our data.frame. See more information here.
To install libraries, we need to be connected to the internet!
To install and load libraries in R we use the following code:
install.packages("rio")
library(rio)
This is the simple way. If you have already loaded rio, then a way to not have to re-install it is with a bit more code that I’ll use quite frequently
if(!require(rio)) install.packages("rio")
library(rio)
This is a bit complex, but illustrates a useful function. In computing we can make the computer do stuff if a certain condition is fulfilled. This is called conditional logic. It is vitally important.
The
if(!require(rio))
bit, is saying “have a look and see whether you already have the rio
package”. The
require(rio)
bit will return a value of TRUE
if it is already loaded in your computer, and FALSE
if not. We only want the computer to install the library if it is not there. If it is not there, the result of require(rio)
will be FALSE
. The if
function only executes the following function if TRUE
, so we need to turn the TRUE
into a FALSE
and vice versa. This is what the !
does. Don’t worry if you don’t get this. It is a bit of magic that saves you a bit of time loading new libraries! The if
statement means that the install.packages()
code only runs if you don’t have rio
already installed. When you have lots of packages to load, you’ll find this very handy!
The
library(rio)
bit makes sure that the R
console can find the function we are after. For example, if we are wanting to use the import()
function from the rio
package, if we haven’t typed library(rio)
we will have to tell R
where to find the import()
function with the following code…
::import("./path/to/csv or Excel file.csv") rio
With the library call, we can now more simply type:
import("./path/to/csv or Excel file.csv")
However, this assumes there is no library loaded earlier with the same import()
function. This introduces a little bit of ambiguity, which one may wish to avoid.
You can find in which libraries R
is looking in a session by typing
searchpaths()
The paths are hierarchical, so that the first .GlobalEnv
(global environment) which is shown to the top right pane in Rstudio is searched first, then tools:rstudio
, then (on my windows machine) C:/Program Files/R/R-4.1.2/library/stats
, etc.
Once library(rio)
is called, R
is now constantly looking inside the rio
library for any functions, so if you ask for the import()
function, R
will now find it there. This applies to any other function you want to call without referencing the library first.
To make things a bit confusing, require()
will also tell R
you are wanting to use the functions in this package (only if it is there, not if it fails and has to install), but there is no harm in adding the library()
code to make doubly sure that R
can find the function inside your package.
Now, make some code to install the magrittr
, visdat
, ggplot
and skimr
packages. Use the if
, require
, install.packages
and library
functions.
Write them in the editor (top left hand screen, and use “CTRL + ENTER” to execute the code).
You’ll see the code execute in the bottom pane.
We will import some data from the internet. It is a cot-death dataset. We will talk about what it means later, but for now will just dive in to have a look at the data. It was a case-control study to look for risk factors for cot-death in the 1980s.
We will use the following code:
<- rio::import("https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx") df
This effectively takes our Excel sheet from the web and pulls it into our computer. The <-
is an assignment operator which names our new spreadsheet df
. I’ve chosen the name df
because the technical word for a spreadsheet in R
is a data.frame
.
Here, "https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx"
is an argument or modifier to the function that does the importing import
. The argument tells rio::import
where to look for the dataset.
You will see our object up on the top right pane (Environment/History
).
You can then double click it to see what it looks like in a mode similar to Excel.
If you are uncertain of all the different options for the rio::import
function, type the following into the editor and execute.
?rio::import
You’ll see help documentation pop-up in the bottom right-hand pane. This is really useful!
We can now have a look for duplicates in the df
object.
duplicated(df), ]
df[
## Pipe
duplicated(df),] %>% nrow
df[
## Old-fashioned
nrow(df[duplicated(df),])
## Remove duplicate rows from data.frame
<- df[!duplicated(df), ] df
Here, we are using square brackets []
to subset the df
object. The code in the square brackets is divided by the comma. Before the comma refers to actions on rows and after applies to columns.
We are essentially saying to R
, “look for any duplicated rows in the df
object. This means looking across all columns, because we haven’t limited it further.
If we instead wrote:
duplicated(df$Mother_age), ]
df[
duplicated(df$Mother_age),] %>% nrow df[
Here, the $
sign is used to indicate a column within the data.frame
.
Interpret the output. Note the use of the %>%
operator. This takes the output of the first argument (the number of duplicated rows in the data.frame
) and than counts them. This is nice for avoiding loads of brackets. An alternative way of doing it is a bit more confusing.
Here’s the code to look at the ranges of values of a data.frame
.
%>% skim df
Interpret the output.
What is the object and which is the function in the code here?
Are the ranges sensible?
::vis_miss(df) +
visdat::theme(plot.margin = unit(c(1,3,1,1), "cm")) ggplot2
Interpret the plot. See if you can figure out how to sort the columns by their missingness. You’ll have to execute ?vis_miss
.
In the data
folder is the heart.csv
file.
Hint:
<- rio::import("./data/heart.csv") df