Getting Started With R

Getting Started with R

What is R

R is

“a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc”

It has a 'base' system with thousands of additional packages
You need to install the base system, and then any further packages
The Comprehensive R Archive Network (CRAN) is at http://cran.r-project.org/index.html

Downloading R

From CRAN click on
- Download R for windows
- Install R for the first time
- Download R 3.3.2 for Windows (or possibly Mac)
- Both versions compatible
If using Windows
- Right-click on the R icon to bring up properties
- Add --internet2 to the end of the Target string on the Shortcut tab
- Click on OK
Installing Rstudio
- https://www.rstudio.com/products/rstudio/download/

Getting Help

Quick-R: http://www.statmethods.net/index.html
Barry Rowlingson’s R Spatial Cheatsheet http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/cheatsheet.html
R Bloggers http://www.r-bloggers.com/
My site http://rpubs.com/chrisbrunsdon

R Basics

Inside R (most) data and results are stored in objects
Objects can be scalars, vectors, matrices, text strings, lists or more generic objects
Generic objects such as:
- data frames
- spatial objects similar to shapefiles
- graphs
R is a programming language allowing new functions to be created
- Some are already provided…

Vectors and Functions

heights <- c(4.7, 5.3, 6.2, 3.8, 4.4, 7.1, 2.5)
heights

## [1] 4.7 5.3 6.2 3.8 4.4 7.1 2.5

length(heights)

## [1] 7

sum(heights)

## [1] 34

mean(heights)

## [1] 4.857143

maxHeight <- max(heights)
maxHeight

## [1] 7.1

Subsets of Vectors - 1

heights

## [1] 4.7 5.3 6.2 3.8 4.4 7.1 2.5

heights[5]

## [1] 4.4

heights[3:6]

## [1] 6.2 3.8 4.4 7.1

heights[c(3,6,1)]

## [1] 6.2 7.1 4.7

heights[c(1,1,2,1,6)]

## [1] 4.7 4.7 5.3 4.7 7.1

Subsets of Vectors - 2

heights

## [1] 4.7 5.3 6.2 3.8 4.4 7.1 2.5

heights[c(-1,-4)]

## [1] 5.3 6.2 4.4 7.1 2.5

heights > 6

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

which(heights > 6)

## [1] 3 6

heights[heights > 6]

## [1] 6.2 7.1

Testing conditions

if (heights[1] > 6) cat("Large\n") else cat("Small\n")

## Small

# Can spread this over lines
if (heights[6] > 6) 
  cat("Large\n") else
  cat("Small\n")

## Large

# Can do multiple instructions
if (heights[1] < 6) {
  cat('Number is very small\n')
  cat('Here are all the heights in order',sort(heights))
}

## Number is very small
## Here are all the heights in order 2.5 3.8 4.4 4.7 5.3 6.2 7.1

Repetition - Loops 1

N <- length(heights)
for (i in 1:N) {
    if (heights[i] > 6) {
       cat(" Tree", i, "- tall.")
       } else {
       cat(" Tree", i, "- short.")
       }
  }

##  Tree 1 - short. Tree 2 - short. Tree 3 - tall. Tree 4 - short. Tree 5 - short. Tree 6 - tall. Tree 7 - short.

Determininistic loop - you know in advance how many cycles (N here)

Repetition - Loops 2

z <- 1
count <- 0
repeat {
  if (z > 100) break
  z <- z * 2
  count <- count + 1
  cat(z,' ')
}

## 2  4  8  16  32  64  128

cat('\n It took',count,' doubles to exceed 100.')

## 
##  It took 7  doubles to exceed 100.

Non-determininistic loop - you don't know in advance how many cycles

Basic Types of Data

Sometimes called levels of measurement
Continuous (numerical)
- ratio data (has a well defined zero) - eg height
- interval data (no well defined zero) - eg date
Categorical (factor)
- Ordinal data (rankable) - eg 'strongly disagree', … ,'strongly agree'
- Nominal data - (no intrinsic order) eg 'red', 'blue', 'yellow'
Influences the type of analysis and visualisation

Factors in R

x <- c("MacOS","Linux","Windows","Windows","MacOS")
xf <- factor(x)
xf # Nominal

## [1] MacOS   Linux   Windows Windows MacOS  
## Levels: Linux MacOS Windows

y <- c("good","very good","good","average","poor","very poor","average")
yf <- factor(y,levels=c("very poor","poor","average","good","very good"),ordered=TRUE)
yf # Ordinal

## [1] good      very good good      average   poor      very poor average  
## Levels: very poor < poor < average < good < very good

yf[1] > y[3]

## [1] FALSE

xf[1] > xf[3]

## Warning in Ops.factor(xf[1], xf[3]): '>' not meaningful for factors

Graphics

R has a rich range of tools for visualising data
Visualisation is often an initial step in data exploration
Here are some simple examples - first a demonstration data set

head(mtcars,n=6)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars is a built-in data frame: note that the columns are are different types: character, real, integer. We can use some simple exploratory visualisations…

Histogram

hist(mtcars$mpg,main="Fuel Consumption")

Boxplot

boxplot(mtcars$mpg,main="Fuel Consumption")

Shows median, quartiles, and the range

Boxplot: Comparing Categories

boxplot(mpg~cyl,main='Fuel Consumption',xlab='Cylinders',data=mtcars)

Boxplot: Nicer Horizontally

boxplot(mpg~cyl,main='Fuel Consumption',xlab='Cylinders',data=mtcars, horizontal=TRUE)

Note boxplot also finds outliers

Relationships: Scatterplot

plot(mtcars$wt,mtcars$mpg,main="MPG vs Weight")

Heavier cars have poorer fuel consumption

Space and Time Data

There is an equally rich variety of approaches for handling spatial and temporal data in R
These will be introduced over the next few weeks
Data sets
- Rainfall records 1850 - 2014
- Crime data