Part 1: Introduction and getting started

The aim of this session is to provide you with the basic tools to get you started with R. I expect that you have already worked through the materials that were provided to you in advance of this training (the ), then this session should be a useful refresher. If this is the first time that you have used R / RStudio then shame on you for not doing your homework, but you should be able to get up to speed.

This session will reinforce some of what was covered in the Introduction to R: basic tools and cover a number of specific but key activities:

R packages
saving workspaces and projects
using scripts
loading data
simple visual data exploration (basic boxplots, histograms, scatter plots)
summary statistics

1. Why use R

R was initially developed by Robert Gentleman and Ross Ihaka of the Department of Statistics at the University of Auckland. R is increasingly becoming the default software package in many areas of science. There are a number of reasons for this:

R includes a very large number of tools, functions and packages
- packages are libraries of tools
- packages are written by scientists in different subject areas
R has the latest methods and tools
New tools are in R 10-20 years before commercial software
- e.g. GWR was around for 10 years before being in ArcGIS
The tools in R are open (i.e. the source code is visible)
- you can see exactly what is being done
- most commercial tools are hidden - black boxes
R is free!

For these reasons R is becoming widely used in many areas of scientific activity and quantitative research.

R can be found at the CRAN website:

2. Working in R

There are 2 key points about working in R

When working in R, either writing your own code or copy and pasting from these materials, you should* write the code into a script or document. Go to File > New File > R Script** to open a new R file.

The reasons for this are so that you get used to using the R console and running the code will help your understanding of the code’s functionality. Then in order to run the code in the R console,a quick way to enter it is to highlight the code (with the mouse or using the keyboard controls) and the press ctrl-R or cmd-Enter on a Mac.

Learning is R is learning to drive. You may pass your test but ti become a good driver it is time behind the wheel that counts. The importance of learning by doing and getting your hands dirty cannot be overstated. Some of the code might look a bit fearsome when first viewed, especially in later session BUT the only really effective way to understand it is to give it a try.

A further minor point is that in the code comments are prefixed by # and are ignored by R when entered into the console.

If you have worked your way through the Introduction to R: basic tools then you will have come across a few things that will be re-capped here:

Assignment: this is the basic process of giving R objects values

vals <- c(4.3,7.1,6.3,5.2,3.2,2.1)

Operations: having assigned values to object that can be manipulated

vals*2

## [1]  8.6 14.2 12.6 10.4  6.4  4.2

sum(vals)

## [1] 28.2

mean(vals)

## [1] 4.7

Indexing: individual elements of R objects with multiple data elements can be referred to:

vals[1]    # first element

## [1] 4.3

vals[1:3]   # a subset of elements 1 to 3

## [1] 4.3 7.1 6.3

sqrt(vals[1:3]) #square roots of the subset

## [1] 2.073644 2.664583 2.509980

vals[c(5,3,2)]  # a subset of elements 5,3,2 - note the ordering

## [1] 3.2 6.3 7.1

There are many different data types in R: character, logical, integer etc - too many to cover here.
There are many different data classes in R: Vectors, Matrices, Factors, Lists

2.2 R packages

When you install R / RStudio it comes with a large number of tools already (refereed to as base functionality).

However, one of the joys of R is the community of users. Users share what they do and create in R in a number of ways. One of these is through packages. Packages are collections of related functions that have been created, tested and supported with help files. These are bundled into a package and shared with other R users via the that users can download from the CRAN repository.

There are 1000s of packages in R. These contain set of tools and can be written by anyone. The number of packages is continually growing. When packages are installed these can be called as libraries. The background to R, along with documentation and information about packages as well as the contributors, can be found at the R Project website http://www.r-project.org.

Packages can be found at the CRAN website - https://cran.r-project.org/web/packages/:

Users install the package once to mount it on their computer, and then it can be called in R scripts as required.

The basic operations are

install the package before the first time it is used
- you may have to set a mirror site the fist time you install a a package
- this is only done once
load the package using the library function to use the package tools
- this is done for each R session

So a typical way to do this is

install.packages("<package name>"", dep = TRUE)

Replacing "<package name>" with "tidyverse" installs the tidyverse package and all of is dependencies (dep = TRUE). You should install the tidyverse package and load using the library function as in the code snippets below. Note the messages telling you which dependency packages are also installed. The tidyverse - see https://www.tidyverse.org - is a collection of R packages designed for data science. We all live in the tidyverse now…!

install.packages("tidyverse", dep = TRUE)

After you have responded to the request from R / RStudio to set a mirror - a site from which to download the package - pick the nearest one! - then the package this can be loaded at the top of an R script

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
## ✔ tibble  1.4.1          ✔ dplyr   0.7.4     
## ✔ tidyr   0.7.2          ✔ stringr 1.2.0     
## ✔ readr   1.1.1          ✔ forcats 0.2.0

## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::vars()   masks ggplot2::vars()

Also note that packages can be also be downloaded using the RStudio menu: Tools > Install Packages…

I prefer the command line option - if several packages need tro be loaded they can be stored in a script, and passed on to other users.

Packes can also be installed from digital places like GitHub or Bioconductor.

3. Loading data

There are a number of ways of loading data into you R session:

read a local file in proprietary format (eg an excel file or .csv file)
read a local R formatted binary file (typically with an .rda or .RData extension)
download and manipulate data from the internet - we are not going to do that in this workshop)
read a file from somewhere in the internet (proprietary or R binary format) - we may do some of that

3.1 Loading data from your your computer

Loading `.csv` format data

The base installation of R comes with core functions for reading .txt, csv and other tabular formats. To load data from local files you need to point R / RStudio to the directory that contains the local file. One way is to use the setwd() function as in the below

## Mac
setwd("/Users/geoaco/Desktop/")
## Windows
setwd("C:\\")

Another is to use the menu system

Session > Set Working Directory … which give you options to chose from.

However you do it you should now set your working directory to the folder that contains the file census_data.csv and run the code below to load the data:

cen.dat <- read.csv("census_data.csv")

This loads a data table of 40 rows and 41 columns. These are Unitary Authorities (local government) in the East of England region in the UK. You can inspect the data in a few ways.

## dimensions - rows and columns
dim(cen.dat)

## [1] 40 41

## column / variable names
names(cen.dat)

##  [1] "Name"                          "Code"                         
##  [3] "OnePerson"                     "MarriedDependents"            
##  [5] "MarriedNoDependent"            "SameSexPartnershipDependent"  
##  [7] "SameSexPartnershipNoDependent" "CohabitingCoupleDependent"    
##  [9] "CohabitingCoupleNoDependent"   "LoneParentDependent"          
## [11] "LoneParentNoDependent"         "MultiPersonFTStudents"        
## [13] "MultiPersonOther"              "NotDeprived"                  
## [15] "Deprived1"                     "Deprived2"                    
## [17] "Deprived3"                     "Deprived4"                    
## [19] "White"                         "Mixed"                        
## [21] "Asian"                         "Black"                        
## [23] "Other"                         "Owned"                        
## [25] "SharedOwned"                   "SocialRented"                 
## [27] "PrivateRented"                 "LivingRentFree"               
## [29] "ManagersDirectors"             "Professionals"                
## [31] "AssociateProfs"                "Administrative"               
## [33] "SkilledTrades"                 "CaringLeisureService"         
## [35] "Sales"                         "PlantMachine"                 
## [37] "Elementary"                    "SocialGrade_AB"               
## [39] "SocialGrade_C1"                "SocialGrade_C2"               
## [41] "SocialGrade_DE"

## look at the first 6 rows and 7 columns
cen.dat[1:6,1:7]

##           Name      Code OnePerson MarriedDependents MarriedNoDependent
## 1 Amber Valley E07000032 0.2786904         0.1578637          0.3167541
## 2     Ashfield E07000170 0.2803401         0.1477489          0.2985412
## 3    Bassetlaw E07000171 0.2804246         0.1547821          0.3246691
## 4        Blaby E07000129 0.2609471         0.1852350          0.3239932
## 5     Bolsover E07000033 0.2902655         0.1476479          0.3033444
## 6       Boston E07000136 0.2812649         0.1416584          0.3171742
##   SameSexPartnershipDependent SameSexPartnershipNoDependent
## 1                 0.000209141                   0.001483003
## 2                 0.000137441                   0.001178064
## 3                 0.000104894                   0.001111880
## 4                 0.000129246                   0.001318306
## 5                 0.000213408                   0.001128014
## 6                 0.000109926                   0.000989337

## use the sumamry function for the first 7 columns
summary(cen.dat[,1:7])

##            Name           Code      OnePerson      MarriedDependents
##  Amber Valley: 1   E06000015: 1   Min.   :0.2380   Min.   :0.1149   
##  Ashfield    : 1   E06000016: 1   1st Qu.:0.2693   1st Qu.:0.1517   
##  Bassetlaw   : 1   E06000017: 1   Median :0.2805   Median :0.1610   
##  Blaby       : 1   E06000018: 1   Mean   :0.2827   Mean   :0.1639   
##  Bolsover    : 1   E07000032: 1   3rd Qu.:0.2916   3rd Qu.:0.1773   
##  Boston      : 1   E07000033: 1   Max.   :0.3617   Max.   :0.2142   
##  (Other)     :34   (Other)  :34                                     
##  MarriedNoDependent SameSexPartnershipDependent
##  Min.   :0.1729     Min.   :0.000e+00          
##  1st Qu.:0.2958     1st Qu.:9.705e-05          
##  Median :0.3170     Median :1.231e-04          
##  Mean   :0.3071     Mean   :1.307e-04          
##  3rd Qu.:0.3316     3rd Qu.:1.681e-04          
##  Max.   :0.3662     Max.   :2.990e-04          
##                                                
##  SameSexPartnershipNoDependent
##  Min.   :0.0004686            
##  1st Qu.:0.0009863            
##  Median :0.0011122            
##  Mean   :0.0011189            
##  3rd Qu.:0.0013016            
##  Max.   :0.0016589            
##

The default for read.csv is that the file has a header (i.e. the first row contains the names of the columns) and that the separator between values in any record is a comma. However these can be changed depending on the nature of the file you are seeking to load into R. A number of different types of files can be read into R.You should examine the help files for reading data in different formats. Enter ??read to see some of these listed. You will note that read.table and write.table require more parameters to be specified than read.csv and write.csv.

Load R binary files

You can also load R binary files. These have the advantage of being very efficient at storing data and quicker to load than for example, .csv files

load("census.rda")
## use ls to see what is loaded
ls()

You should see that a variable called census has been loaded. It is the same as the data read into cen.dat. You can explore the census R object if you want t using the functions above that were applied to cen.dat.

3.2 Loading remote files

You can use the read.csv, read.tab;e and load functions to read data directly from a URL.

url <- url("http://www.people.fas.harvard.edu/~zhukov/Datasets.RData")
load(url)
ls()

The repmis package allows you to store R binary data on Github and make it accessible to your scripts (note the .RData format):

library(repmis)
# load the data 
source_data("https://github.com/lexcomber/CAS_GW_Training/blob/master/Liudaogou.RData?raw=True")

3.3 Other data formats

As you work with R you will want to use all kinds of different data formats - from different flavours of data table .CSV, Excel, SPSS to explicitly geographical data such as shapefiles and rasters. These can all be loaded directly into R using functions from different packages. There are too many to cover fcomprehensively. But generally if there is a data format out there, there is alo a tool to get it into R!

The foreign package can be used to load many file types (e.g. EXCEL and SPSS) and a number of different approaches for reading data types are listed here: https://www.r-bloggers.com/read-excel-files-from-r/

4. Saving data

4.1 CSV Files

Data can be written into a Comma Separated Variable file using the command write.csv and then read back into a different variable, as follows:

write.csv(census, file = "new_census.csv")

This writes a .csv file into the current working directory. If you open it using a text editor or a spreadsheet software, you will see that it has the expected column plus the index for each record. This is because the default for write.csv includes row.names = TRUE. Again examine the help file for this function.

write.csv(census, file = "test.csv", row.names = F)

4.2 R Data files

It is possible to save variables that are in your workspace to a designated .rda or .RData file. This can be loaded at the start of your next session. Saving your workspace saves everything that is present in your workspace - as listed by ls() - whilst the save command allows you to specify what variables you wish to save.

There are a number of ways to do this:

You can save the workspace using the drop down menus Session > Save Workspace As…
You can save the workspace using the save.image function

save.image(file = "mywkspcae.RData")

you can save idficudal elements using the save function

# this will save everything in the workspace
save(list = ls(), file = "MyData.RData")
# this will save just appling
save(list = "census", file = "MyData.RData")
# this will save vals and census
save(list = c("census", "vals"), file = "MyData.RData")

Part 2: Working with data

5. Simple visual data explorations

In the previous sections you loaded a number of data tables. To make sure you have the right data you should clear your workspace and re-load the folloiwng data. To clear our workspace either go to Session > Clear Workspace…* or enter

rm(list = ls())

Load the census attributes again for the the UAs in the English East Midlands

load("census.rda")

The data are all in percentages and there are clear groups related to

Houshold type (columns 3 to 13)
Household deprivation dimensions 0 to 4 (columns 14 to 18)
Ethnic group (columns 19 to 23)
Tenure (columns 24 to 28)
Occupation type (colums 29 to 37)
Social Grade (columns 38 to 41)

The simplest way of summarising the data is to examine the distributions within groups. For example the code below uses the summary() function to look at the distribution of Household deprivation:

summary(census[,14:18])

##   NotDeprived       Deprived1        Deprived2        Deprived3      
##  Min.   :0.3266   Min.   :0.2933   Min.   :0.1288   Min.   :0.01954  
##  1st Qu.:0.4001   1st Qu.:0.3136   1st Qu.:0.1651   1st Qu.:0.03132  
##  Median :0.4480   Median :0.3206   Median :0.1877   Median :0.04025  
##  Mean   :0.4405   Mean   :0.3228   Mean   :0.1903   Mean   :0.04316  
##  3rd Qu.:0.4833   3rd Qu.:0.3287   3rd Qu.:0.2130   3rd Qu.:0.05735  
##  Max.   :0.5482   Max.   :0.3636   Max.   :0.2604   Max.   :0.08053  
##    Deprived4        
##  Min.   :0.0008666  
##  1st Qu.:0.0021438  
##  Median :0.0028259  
##  Mean   :0.0033045  
##  3rd Qu.:0.0041442  
##  Max.   :0.0101442

The hist function can be used visually summarise distributions of values amongst the UAs. Try running the code below;

## set the plot paramteters
# 1 row and 4 columns 
par(mfrow = c(1,4))
hist(100*census$OnePerson, xlim = c(0,100),
  xlab = "% of Households", main = "One person households", col = "red")
hist(100*census$MarriedNoDependent, xlim = c(0,100),
  xlab = "% of Households", main = "Married with dependents", col = "#FFFFBF")
hist(100*census$CohabitingCoupleNoDependent, xlim = c(0,100),
  xlab = "% of Households", main = "Cohabiting with dependents", border = "#FDAE61")
hist(100*census$LoneParentDependent, xlim = c(0,100),
  xlab = "% of Households", main = "Lone parent with dependents", border = "cyan")

## reset par
par(mfrow = c(1,1))

The boxplot() function also provides a useful way of summarising data:

boxplot(100*census$PrivateRented, main = "The distribution of Private Rented Households", ylab="% of all Households")

But it can also be used to explore the interaction between variables and different factors. For the sake of argument the code below labels each UA areas with the dominant social grade class

census$social.class <- names(census[38:41])[apply(census[, 38:41], 1, which.max)]
census$social.class <- sub("SocialGrade_","",census$social.class)

Now we can use this to explore the distribution of different types of housing with dominant social grade. Note the different parameters that are being passed to boxplot. You should exolore these in your own time.

par(mar = c(3,8,3,3))
# plot 1
boxplot(100*census$Owned~census$social.class,
  horizontal = TRUE, outline = FALSE, lwd = 0.5, las=1,
  col = c("#D7191C","#FFFFBF", "#2B83BA"),
  main="% of Owned property by dominant social grades")

# plot 2
boxplot(100*census$SocialRented~census$social.class,
  horizontal = TRUE, outline = FALSE, lwd = 0.5, las=1,
  col = c("#D7191C","#FFFFBF", "#2B83BA"),
  main="% of Social Rented property by dominant social grades")

We may also need to be aware of the ecological fallacy here - the dominant social grade may not be dominant by that much - consider a 30:30:40 split. So the 10% or so households in social housing from areas with dominant social group AB may not actually be AB households themselves.

Finally it is possible to examine the correlations between data using the cor function:

round(cor(census[, c(24:28, 38:41)]), 3)

##                 Owned SharedOwned SocialRented PrivateRented
## Owned           1.000       0.050       -0.913        -0.879
## SharedOwned     0.050       1.000       -0.244         0.130
## SocialRented   -0.913      -0.244        1.000         0.611
## PrivateRented  -0.879       0.130        0.611         1.000
## LivingRentFree -0.223       0.264        0.060         0.297
## SocialGrade_AB  0.610       0.322       -0.663        -0.431
## SocialGrade_C1  0.306       0.211       -0.395        -0.116
## SocialGrade_C2  0.265      -0.222       -0.118        -0.386
## SocialGrade_DE -0.761      -0.284        0.786         0.576
##                LivingRentFree SocialGrade_AB SocialGrade_C1 SocialGrade_C2
## Owned                  -0.223          0.610          0.306          0.265
## SharedOwned             0.264          0.322          0.211         -0.222
## SocialRented            0.060         -0.663         -0.395         -0.118
## PrivateRented           0.297         -0.431         -0.116         -0.386
## LivingRentFree          1.000          0.057         -0.311          0.080
## SocialGrade_AB          0.057          1.000          0.572         -0.477
## SocialGrade_C1         -0.311          0.572          1.000         -0.516
## SocialGrade_C2          0.080         -0.477         -0.516          1.000
## SocialGrade_DE          0.034         -0.930         -0.690          0.242
##                SocialGrade_DE
## Owned                  -0.761
## SharedOwned            -0.284
## SocialRented            0.786
## PrivateRented           0.576
## LivingRentFree          0.034
## SocialGrade_AB         -0.930
## SocialGrade_C1         -0.690
## SocialGrade_C2          0.242
## SocialGrade_DE          1.000

Here we can see that there ae some very strong negatve and positive associatons bstweenthe proportions (percentages) of households under different types of tenure and social grade.

Notice that the c(24:28, 38:41) was used to select the data columns to compare. The round function was used to limit the number of decimal places of the output. The names(census) command can help decide which variables to consider.

We can examine these in further detail using the plot() function. This plots all of the variables against each other and is useful for displaying how all variables correlate.

plot(census[,c(24:28, 38:41)], cex = .6)

It is clear that there are some very strong trends between variables here that we may be interested in in our analysis.

Summary

This section has provided some rubrics for working in R (scripts, etc) and showed how to get data in and out of R and developed some simple visualisations of data.

In later sessions we will expand on this but using the ggplot2 package. This comes with the tidyverse and although it involves a bit of a learning curve it produces excellent visualisations and is extremebly controllable. However, there is plenty of help and advice on the internet and in core text books, and the information provided here.

Day 1 (AM): Getting Started

Lex Comber and Chris Brunsdon

July 2018

Part 1: Introduction and getting started

1. Why use R

2. Working in R

2.2 R packages

3. Loading data

3.1 Loading data from your your computer

Loading `.csv` format data

Load R binary files

3.2 Loading remote files

3.3 Other data formats

4. Saving data

4.1 CSV Files

4.2 R Data files

Part 2: Working with data

5. Simple visual data explorations

Summary

End of Session 1

Day 1 (AM): Getting Started

Lex Comber and Chris Brunsdon

July 2018

Part 1: Introduction and getting started

1. Why use R

2. Working in R

2.2 R packages

3. Loading data

3.1 Loading data from your your computer

Loading .csv format data

Load R binary files

3.2 Loading remote files

3.3 Other data formats

4. Saving data

4.1 CSV Files

4.2 R Data files

Part 2: Working with data

5. Simple visual data explorations

Summary

End of Session 1

Loading `.csv` format data