Getting started

1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1040 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed here and on the front page of the BS1040 blackboard site. This assessment contributes 2.5% of module marks. You will receive feedback on this assessment after the submission deadline.

2 Getting R and Rstudio onto your computer

This of course depends on what type of computer you have.

2.1 Mac

2.1.1 Install the latest version of R.

  • On a web browser open up https://cran.r-project.org/bin/macosx/
  • Click on the latest binary (R 4.0.2.pkg, when this was written). You can find it as the blue writing on the left hand side
  • Follow the installation instructions

2.1.2 Install the latest version of Rstudio.

2.2 Windows

2.2.1 Install the latest version of R.

  • On a web browser open up https://cran.r-project.org/bin/windows/base/
  • Click on the ‘Download R 4.0.2 for Windows’ link (as of time of writing)
  • The distribution is distributed as an installer R-4.0.2-win.exe. Just run this for a Windows-style installer.

2.2.2 Install the latest version of Rstudio.

2.3 Linux

  • R is part of many Linux distributions, you should check with your Linux package management system.
  • If you don’t have it, have a look here https://cloud.r-project.org/ for your distribution.

2.3.1 Install the latest version of Rstudio.

2.4 Chromebook/Android Tablet/iPad (Actually any computer)

A chromebook doesn’t really allow you to download software. It wants you to work on the cloud. So lets do that.

  • On a web browser open in a new window
  • Log in or sign up
  • Create a new project (Blue New Project button)
  • On the cloud you’ll first have to upload (bottom right pane) the fish data before you import it (see below).

2.5 University PCs (for when you want to work in the computer rooms)

2.5.1 Install the latest version of R.

  • You can install R using the Software center (click on the windows icon and look at the second column) on CFS machines
  • choose R and click install
  • R should now be in your program list.

2.5.2 Install the latest version of Rstudio.

  • Same as above, but you might need to search for it in the Software center. It wasn’t on the front page when I looked.
  • choose Rstudio and click install
  • Rstudio should now be in your program list. Open it.

3 R, Rstudio what?

R is the programming language we will be using during this course. Rstudio is an integrated development environment (IDE) for R. They are separate programs, but we will use R through Rstudio. This is an easier and all round better way of using R. You can use R by itself but everything is harder. The console on the bottom left in Rstudio is what R looks like by itself. You need to install both R and Rstudio.

4 Getting the data into R

There are lots of ways of getting data into R. Its one of the most annoying things about it as a beginner. During most of the sessions, I’m going to try and use built-in datasets. But for somethings in the course and for your own data you are going to need to know how to get data in R. Most people will have their data intially on an excel sheet (.xlsx). But a .csv (comma separated values) is a simpler and more useful format to keep data in. So the first thing you’ll usually have to do when you are doing a data analysis in the wild is to convert .xlsx to .csv. Full instructions here. BUT you don’t need to do this during this course as any external data I give you will already be in .csv form.

  1. Look at the right hand top window in Rstudio. See the Import Dataset. Use this to import the data as textfile or From Text (base) in newer versions. Make sure that the heading option is on.
  2. Notice what you really did was displayed in the console.
fish.data <- read.csv("~/Dropbox/BS1070/nonsense_data_2015.csv")
  1. That means if you typed that into the console you would get the same effect (with your filepath not mine).
  2. Have a look at the data it should have 235 observations of 10 variables.

5 Some R basics

5.1 Using R as a ridculously overpowered calculator

In the console window, type some mathematical operations. I’ll get you started.

2 * 3
[1] 6
sin(2 * pi)
[1] -2.449294e-16

Try adding subtracting and dividing. What does log do? Hint: not what you think.

5.2 Creating variables

An important skill you need is creating variables. At its simplest, I want x to be equal to 2.

x <- 2

Whats with <-? <- is called the assignment operator. Why don’t we use =? Short answer, thats just the way R is. Long answer, if you’re interested.

Bit more complicated, make y equal 0,2,4,6,8,10

y <- seq(0, 10, 2)

You just used a function. Its called seq. Want to find out about it? Use your googlefu skills. More and more doing the practicals, we won’t give you the answers. Instead, we’ll expect you to find them yourself. Why? Because thats how everyone does analysis. Once you figure that out, you’ll realise that with the basics we are teaching you and the ability to look things up, no analysis or complicated data visualization is beyond your abilities.

Task: Calculate z which is the product of x and y. (Hint: In mathematics, a product is the result of multiplying)

Blackboard MCQ: z is a vector with six values. What is the highest value?

5.3 Comments

Comments are remarks in a program that is intended to help human readers understand what is going on, but are ignored by the computer. Comments in R start with a # character and run to the end of the line. Why do we use comments? To remind ourselves or to tell our collaborators what a line of code was meant to do. Remember the adage, we write comments for the next idiot to read the code, because it will probably be us.

demo(graphics)  # Note for students, run this comment its cool. 
# Also note the computer will completely ignore what comes after the #.  Let me
# demonstrate: It was the best of times, it was the worst of times, it was the
# age of wisdom, it was the age of foolishness, it was the epoch of belief, it
# was the epoch of incredulity, it was the season of Light, it was the season
# of Darkness, it was the spring of hope, it was the winter of despair, we had
# everything before us, we had nothing before us, we were all going direct to
# Heaven, we were all going direct the other way – in short, the period was so
# far like the present period, that some of its noisiest authorities insisted
# on its being received, for good or for evil, in the superlative degree of
# comparison only.

5.4 R packages

What we have been using so far is called base R. Its the stuff that comes working out of the box with R. You can do an amazing amount with this. But the capabilities of R have being extended hugely over the years with packages. The below is from Hadley Wickham’s R package book:

In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of January 2015, there were over 6,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

  • You install them from CRAN with install.packages(“x”).
  • You use them in R with library(“x”).
  • You get help on them with package?x and help(package = “x”).

So lets use a package called skimr. The install.packages command only has to be used the first time you use a package (from then on its on your computer). After that the library command turns it on.

install.packages("skimr")  #You will only have to do this once on your computer.
library("skimr")  #You will only have to do this once per session. It 'turns on' the package
skim(iris)  #skim is a command in r. iris is a built-in dataset we are going to use.
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

I haven’t run this package on all possible computers. If you are having problems with it, we could get similar information using base R with a command unsurprisingly called summary.

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Task: Perform the skim (or summary if there is a problem) command on your fish.data (the data you loaded from a csv file in Section 4)

Blackboard MCQ: Whats the mean value of Standard.Length? (clue:The mean value for Sepal.Length was 5.84 from skim and 5.843 from summary)

5.5 Getting your session info

Task: Finally, as a test that you have R up and running on your computer, I’d like you to type the following command

sessionInfo()

Blackboard MCQ: What version of R do you have? For example, if you got R version 1.4.3 (2010-04-26), select 1 as the answer.