There’s a bit of debate about what software you should use to analyse data. Every way of doing it has advantages and disadvantages. We’ll be using R in this course because:
When you type everything you have a written record of everything you’ve done to a piece of data. This makes it easier to check your logic and the preserve the original data.
You will often need to do the same piece of analysis over and over on different datasets. Having the code to do the analysis written out makes it easy to repeat.
R is free and open source.
R has a lot of excellent libraries (downloadable extensions to R) which make generating plots, doing specialised techniques, making reports and even producing interactive dashboards easy.
We are going to install R the programming language. R is programme that actually does the analysis - you interact with this programme by typing R commands. For example, if you type 2 + 2, R will do the maths and return the result to you.
Then we are going to install an IDE (interactive development environment) called RStudio. RStudio is a very popular way of interacting with R. While you don’t need it to write R code RStudio makes it easy to do things like:
Write longer pieces of code and get R to run it piece by piece.
Keep track of what data R knows about.
Organise all your R code files.
See plots and tables you’ve created in R.
Make special types of R files e.g. reports, dashboards etc.
Access R’s in-built help
Manage which R packages (extensions to R) you have installed.
R is free and open source. It is written by volunteers and all the all packages you’ll use were also written by volunteers.
RStudio is also free and open source, but is made by a profit making company. They make their money by selling a professional version of RStudio that runs on a server and has support.
For Windows Users:
Go to www.r-project.org
Click on ‘download R’ link in the first paragraph.
Chose the one of the links at the top of the page under ‘0-Cloud’.
Choose the ‘base’ subdirectory.
Click the Download R link at the top of the page.
Once of the file has downloaded click on the download.
Go through the installer, most of the defaults should be fine
For Mac Users:
Go to www.r-project.org
Click on ‘download R’ link in the first paragraph.
Chose the one of the links at the top of the page under ‘0-Cloud’.
Download R for Mac.
Chose the newest .pkg file, should be at the top of the page.
Once of the file has downloaded click on the download.
Go through the installer, most of the defaults should be fine.
Go to https://www.rstudio.com
Chose download RStudio at the top of the page
You will want to download RStudio Desktop, and you will want to pick the free license
Chose the Windows Vista/7/8/10 installer.
Once it has downloaded click on the download.
Go through the installer, most of the defaults should be fine.
For Mac Users:
Go to https://www.rstudio.com
Chose download RStudio at the top of the page
You will want to download RStudio Desktop, and you will want to pick the free license
Chose the Mac OSX installer.
Once it has downloaded click on the download.
Drag RStudio into the Applications folder.
Once you have RStudio installed and opened you should see four panels. We’ll ignore the two right-hand-side panels for just now.
The bottom-left panel is the R interpreter. We can type R commands into the interpreter and R will do some calculations and return an answer. Let’s start with one of the most basic commands possible. Let’s get R to add 2 and 2. Type 2 + 2 into the interpreter and press enter.
You should see something like this:
2 + 2
## [1] 4
After you hit enter R understood the command and found the answer, returning it almost instantly to you. (The little [1] in front of the answer just means that there’s only one element to the answer). Now, type the same thing in the top left panel. When you press enter here nothing will happen. The top left panel is basically just a very simple text editor like Notepad on windows or TextEdit on a Mac. You can absolutely write R code in a separate text editor and many people do. A big advantage of writing code inside RStudio is that it’s very easy to transfer code from the editor to the R interpreter. Just move your cursor to the line with 2 + 2 and press ctrl and enter at the same time. Pressing cmd+enter also works on a Mac. The code you have written will now appear in the interpreter along with the answer.
If you have a longer piece of code that goes across multiple lines you will need to highlight the lines and then press ctrl+enter. Try this just now after typing this in the top left editor.
1 + 2 +
4
## [1] 7
You can save the code you have written in the text editor and come back to it at any time. Just go to File and then Save As…. When going through this unit keep everything you’ve written in the text editor saved.
The assignment operator in R is <- It is sometimes possible to use = for assignment
R uses $ in a manner analogous to the way other languages use dot.
R has several one-letter reserved words: c, q, s, t, C, D, F, I, and T.
The primary data type in R is the vector. This is an ordered collection of numbers with no other structure
The length of a vector is the number of elements in the container
A vector in R is a container vector, a statistician’s collection of data, not a mathematical vector. The R language is designed around the assumption that a vector is an ordered set of measurements rather than a geometrical position or a physical state.
Vectors are created using the c function. For example, p <- c(2,3,5,7) Elements of a vector can be accessed using []. So in the above example, p[3] is 5.
The type of a vector is the type of the elements it contains and must be one of the following: logical, integer, double, complex, character, or raw. All elements of a vector must have the same underlying type. This restriction does not apply to lists.
You can input T or TRUE for true values and F or FALSE for false values.
Lists are like vectors, except elements need not all have the same type.
Elements can be access by position using [[]]. Named elements may be accessed either by position or by name.
In a sense, R does not support matrices, only vectors. But you can change the dimension of a vector, essentially making it a matrix.
For example, m <- array( c(1,2,3,4,5,6), dim=c(2,3) ) creates a matrix m
m <- array( c(1,2,3,4,5,6),dim=c(2,3))
As in other programming languages, the result of an operation on numbers may return NaN, the symbol for “not a number.”
R also has a different type of non-number, NA for “not applicable.” NA is used to indicate missing data
f <- function(a, b)
{
return (a+b)
}
The function function returns a function, which is usually assigned to a variable, f in this case, but need not be.
Note that return is a function; its argument must be contained in parentheses
The use of return is optional; otherwise the value of the last line executed in a function is its return value.
f <- function(a9, b7)
{
(a+b)
}
How tall are you in centimetres?
On a scale from -3 to 3, how much do you like spinach? (On this scale -3 means you hate spinach, 0 means you are neutral on spinach, and 3 means you love spinach.)
On the same scale how much do like chocolate?
You take their responses on each question and type them into R as vectors:
height <- c(133, 110, 224, 134, 135, 136, 125, 137, 104, 132, 114, 130, 129, 237, 131)
spinach_rating <- c(0, 1, -3, 0, -2, 0, -3, 1, 0, -3, -3, 3, 3, 0, 2)
chocolate_rating <- c(3, 3, 0, 3, -3, 3, 0, 2, 2, 2, 3, 3, 2, 3, 1)
To link data together we use the function data.frame. This takes in multiple vectors of data and converts them into a special data frame object:
data.frame(height, spinach_rating, chocolate_rating)
## height spinach_rating chocolate_rating
## 1 133 0 3
## 2 110 1 3
## 3 224 -3 0
## 4 134 0 3
## 5 135 -2 -3
## 6 136 0 3
## 7 125 -3 0
## 8 137 1 2
## 9 104 0 2
## 10 132 -3 2
## 11 114 -3 3
## 12 130 3 3
## 13 129 3 2
## 14 237 0 3
## 15 131 2 1
If we want to access the individual vectors in a dataframe we need to tell R where that vector is (it’s inside a dataframe). We do this using the $ operator.
df<-data.frame(height, spinach_rating, chocolate_rating)
df$height
## [1] 133 110 224 134 135 136 125 137 104 132 114 130 129 237 131
#df$height means the vector named height inside the dataframe df. Every time you use the height vector you will need to use "df$height.
Keeping objects in dataframes and accessing them using $ does seem awkward at first. However, it is often useful in the long run for three reasons:
Some R libraries (dplyr and ggplot2) are designed to work well with dataframes. You really need to have your data in dataframes to use these libraries effectively and in turn they make working with dataframes easier.
If we want to add an extra vector to our dataframe we can just assign to a name in that dataframe directly.
df$age <- c("Adult", "Child", "Adult", "Adult", "Child", "Adult", "Child",
"Adult", "Child", "Child", "Child", "Adult", "Adult", "Adult",
"Adult")
Deleting a vector from a data frame can just be done by setting the vector to null.
df$age <- NULL
A very useful function in R is str. It gives you a string representation of what is in an object. Using it on a dataframe will give you a compact description of all the vectors in that dataframe.
str(df)
## 'data.frame': 15 obs. of 3 variables:
## $ height : num 133 110 224 134 135 136 125 137 104 132 ...
## $ spinach_rating : num 0 1 -3 0 -2 0 -3 1 0 -3 ...
## $ chocolate_rating: num 3 3 0 3 -3 3 0 2 2 2 ...
The ‘Environment’ panel on the top right in RStudio will also give you the descriptions from str if you click on the triangle next to a dataframe. You can also view the entire dataframe by using View (or by clicking on the name of a dataframe in the Environment panel). This is useful if you are familiar with working in Excel, and like to be able to see the data you are working with.
View(df)
Unlike Excel you cannot edit a file from the view mode - although this is a good thing because you can’t accidently mess up the data!
Comments
Comments begin with # and continue to the end of the line