Sections

Introduction

Why Use R

There’s a bit of debate about what software you should use to analyse data. Every way of doing it has advantages and disadvantages. We’ll be using R in this course because:

  1. When you type everything you have a written record of everything you’ve done to a piece of data. This makes it easier to check your logic and the preserve the original data.

  2. You will often need to do the same piece of analysis over and over on different datasets. Having the code to do the analysis written out makes it easy to repeat.

  3. R is free and open source.

  4. R has a lot of excellent libraries (downloadable extensions to R) which make generating plots, doing specialised techniques, making reports and even producing interactive dashboards easy.

Learning Outcomes

  1. Understand how to write basic R code and how data exists inside an R session
  2. Be able to read many types of data into R
  3. Be able to carry out a very wide range of data wrangling tasks using dplyr
  4. Know how to re-code variables in R, including missing values.
  5. Create graphs and Visualizations in R
  6. Introduction to either (1) Statistical Analysis Using R (2) Automating Report Using R (3) Mapping Using R

Installation and getting Started

Installing R and RStudio

We are going to install R the programming language. R is programme that actually does the analysis - you interact with this programme by typing R commands. For example, if you type 2 + 2, R will do the maths and return the result to you.

Then we are going to install an IDE (interactive development environment) called RStudio. RStudio is a very popular way of interacting with R. While you don’t need it to write R code RStudio makes it easy to do things like:

  1. Write longer pieces of code and get R to run it piece by piece.

  2. Keep track of what data R knows about.

  3. Organise all your R code files.

  4. See plots and tables you’ve created in R.

  5. Make special types of R files e.g. reports, dashboards etc.

  6. Access R’s in-built help

  7. Manage which R packages (extensions to R) you have installed.

R is free and open source. It is written by volunteers and all the all packages you’ll use were also written by volunteers.

RStudio is also free and open source, but is made by a profit making company. They make their money by selling a professional version of RStudio that runs on a server and has support.

Install R

For Windows Users:

  1. Go to www.r-project.org

  2. Click on ‘download R’ link in the first paragraph.

  3. Chose the one of the links at the top of the page under ‘0-Cloud’.

  4. Choose the ‘base’ subdirectory.

  5. Click the Download R link at the top of the page.

  6. Once of the file has downloaded click on the download.

  7. Go through the installer, most of the defaults should be fine

For Mac Users:

  1. Go to www.r-project.org

  2. Click on ‘download R’ link in the first paragraph.

  3. Chose the one of the links at the top of the page under ‘0-Cloud’.

  4. Download R for Mac.

  5. Chose the newest .pkg file, should be at the top of the page.

  6. Once of the file has downloaded click on the download.

  7. Go through the installer, most of the defaults should be fine.

Install R Studio
  1. Go to https://www.rstudio.com

  2. Chose download RStudio at the top of the page

  3. You will want to download RStudio Desktop, and you will want to pick the free license

  4. Chose the Windows Vista/7/8/10 installer.

  5. Once it has downloaded click on the download.

  6. Go through the installer, most of the defaults should be fine.

For Mac Users:

  1. Go to https://www.rstudio.com

  2. Chose download RStudio at the top of the page

  3. You will want to download RStudio Desktop, and you will want to pick the free license

  4. Chose the Mac OSX installer.

  5. Once it has downloaded click on the download.

  6. Drag RStudio into the Applications folder.

R Basics

Using RStudio

Once you have RStudio installed and opened you should see four panels. We’ll ignore the two right-hand-side panels for just now.

The bottom-left panel is the R interpreter. We can type R commands into the interpreter and R will do some calculations and return an answer. Let’s start with one of the most basic commands possible. Let’s get R to add 2 and 2. Type 2 + 2 into the interpreter and press enter.

You should see something like this:

2 + 2
## [1] 4

After you hit enter R understood the command and found the answer, returning it almost instantly to you. (The little [1] in front of the answer just means that there’s only one element to the answer). Now, type the same thing in the top left panel. When you press enter here nothing will happen. The top left panel is basically just a very simple text editor like Notepad on windows or TextEdit on a Mac. You can absolutely write R code in a separate text editor and many people do. A big advantage of writing code inside RStudio is that it’s very easy to transfer code from the editor to the R interpreter. Just move your cursor to the line with 2 + 2 and press ctrl and enter at the same time. Pressing cmd+enter also works on a Mac. The code you have written will now appear in the interpreter along with the answer.

If you have a longer piece of code that goes across multiple lines you will need to highlight the lines and then press ctrl+enter. Try this just now after typing this in the top left editor.

1 + 2 +
4
## [1] 7

You can save the code you have written in the text editor and come back to it at any time. Just go to File and then Save As…. When going through this unit keep everything you’ve written in the text editor saved.

Assignment and underscore

The assignment operator in R is <- It is sometimes possible to use = for assignment

Variable names

R uses $ in a manner analogous to the way other languages use dot.

R has several one-letter reserved words: c, q, s, t, C, D, F, I, and T.

Vectors

The primary data type in R is the vector. This is an ordered collection of numbers with no other structure

The length of a vector is the number of elements in the container

A vector in R is a container vector, a statistician’s collection of data, not a mathematical vector. The R language is designed around the assumption that a vector is an ordered set of measurements rather than a geometrical position or a physical state.

Vectors are created using the c function. For example, p <- c(2,3,5,7) Elements of a vector can be accessed using []. So in the above example, p[3] is 5.

Types

The type of a vector is the type of the elements it contains and must be one of the following: logical, integer, double, complex, character, or raw. All elements of a vector must have the same underlying type. This restriction does not apply to lists.

Boolean operators

You can input T or TRUE for true values and F or FALSE for false values.

Lists

Lists are like vectors, except elements need not all have the same type.

Elements can be access by position using [[]]. Named elements may be accessed either by position or by name.

Matrices

In a sense, R does not support matrices, only vectors. But you can change the dimension of a vector, essentially making it a matrix.

For example, m <- array( c(1,2,3,4,5,6), dim=c(2,3) ) creates a matrix m

m <- array( c(1,2,3,4,5,6),dim=c(2,3)) 

Missing values and NaNs

As in other programming languages, the result of an operation on numbers may return NaN, the symbol for “not a number.”

R also has a different type of non-number, NA for “not applicable.” NA is used to indicate missing data

Comments

Comments begin with # and continue to the end of the line

Functions in R

f <- function(a, b)
{
    return (a+b)
}

The function function returns a function, which is usually assigned to a variable, f in this case, but need not be.

Note that return is a function; its argument must be contained in parentheses

The use of return is optional; otherwise the value of the last line executed in a function is its return value.

f <- function(a9, b7)
{
    (a+b)
}

Dataframes

  1. How tall are you in centimetres?

  2. On a scale from -3 to 3, how much do you like spinach? (On this scale -3 means you hate spinach, 0 means you are neutral on spinach, and 3 means you love spinach.)

  3. On the same scale how much do like chocolate?

You take their responses on each question and type them into R as vectors:

height <- c(133, 110, 224, 134, 135, 136, 125, 137, 104, 132, 114, 130, 129, 237, 131)
spinach_rating <- c(0,  1, -3,  0, -2,  0, -3,  1,  0, -3, -3,  3,  3,  0,  2)
chocolate_rating <- c(3, 3, 0, 3, -3, 3, 0, 2, 2, 2, 3, 3, 2, 3, 1)

To link data together we use the function data.frame. This takes in multiple vectors of data and converts them into a special data frame object:

data.frame(height, spinach_rating, chocolate_rating)
##    height spinach_rating chocolate_rating
## 1     133              0                3
## 2     110              1                3
## 3     224             -3                0
## 4     134              0                3
## 5     135             -2               -3
## 6     136              0                3
## 7     125             -3                0
## 8     137              1                2
## 9     104              0                2
## 10    132             -3                2
## 11    114             -3                3
## 12    130              3                3
## 13    129              3                2
## 14    237              0                3
## 15    131              2                1

Working with dataframes

If we want to access the individual vectors in a dataframe we need to tell R where that vector is (it’s inside a dataframe). We do this using the $ operator.

df<-data.frame(height, spinach_rating, chocolate_rating)
df$height
##  [1] 133 110 224 134 135 136 125 137 104 132 114 130 129 237 131
#df$height means the vector named height inside the dataframe df. Every time you use the height vector you will need to use "df$height.

Keeping objects in dataframes and accessing them using $ does seem awkward at first. However, it is often useful in the long run for three reasons:

Some R libraries (dplyr and ggplot2) are designed to work well with dataframes. You really need to have your data in dataframes to use these libraries effectively and in turn they make working with dataframes easier.

  1. Most data gets read into R in the form of a dataframe.

If we want to add an extra vector to our dataframe we can just assign to a name in that dataframe directly.

df$age <- c("Adult", "Child", "Adult", "Adult", "Child", "Adult", "Child", 
         "Adult", "Child", "Child", "Child", "Adult", "Adult", "Adult", 
     "Adult")

Deleting a vector from a data frame can just be done by setting the vector to null.

df$age <- NULL

A very useful function in R is str. It gives you a string representation of what is in an object. Using it on a dataframe will give you a compact description of all the vectors in that dataframe.

str(df)
## 'data.frame':    15 obs. of  3 variables:
##  $ height          : num  133 110 224 134 135 136 125 137 104 132 ...
##  $ spinach_rating  : num  0 1 -3 0 -2 0 -3 1 0 -3 ...
##  $ chocolate_rating: num  3 3 0 3 -3 3 0 2 2 2 ...

The ‘Environment’ panel on the top right in RStudio will also give you the descriptions from str if you click on the triangle next to a dataframe. You can also view the entire dataframe by using View (or by clicking on the name of a dataframe in the Environment panel). This is useful if you are familiar with working in Excel, and like to be able to see the data you are working with.

View(df)

Unlike Excel you cannot edit a file from the view mode - although this is a good thing because you can’t accidently mess up the data!

https://docs.google.com/spreadsheets/d/17mC-uKaszxDLxQB5cS7ZoaaYyHT0sHtJ/edit?usp=sharing&ouid=114321043224196873955&rtpof=true&sd=true