The aim of this session is to provide you with the basic tools to get you started with R. I expect that you have already worked through the materials that were provided to you in advance of this training (the ), then this session should be a useful refresher. If this is the first time that you have used R / RStudio then shame on you for not doing your homework, but you should be able to get up to speed.
This session will reinforce some of what was covered in the Introduction to R: basic tools and cover a number of specific but key activities:
R was initially developed by Robert Gentleman and Ross Ihaka of the Department of Statistics at the University of Auckland. R is increasingly becoming the default software package in many areas of science. There are a number of reasons for this:
For these reasons R is becoming widely used in many areas of scientific activity and quantitative research.
R can be found at the CRAN website:
There are 2 key points about working in R
The reasons for this are so that you get used to using the R console and running the code will help your understanding of the code’s functionality. Then in order to run the code in the R console,a quick way to enter it is to highlight the code (with the mouse or using the keyboard controls) and the press ctrl-R or cmd-Enter on a Mac.
A further minor point is that in the code comments are prefixed by # and are ignored by R when entered into the console.
If you have worked your way through the Introduction to R: basic tools then you will have come across a few things that will be re-capped here:
vals <- c(4.3,7.1,6.3,5.2,3.2,2.1)
vals*2
## [1] 8.6 14.2 12.6 10.4 6.4 4.2
sum(vals)
## [1] 28.2
mean(vals)
## [1] 4.7
vals[1] # first element
## [1] 4.3
vals[1:3] # a subset of elements 1 to 3
## [1] 4.3 7.1 6.3
sqrt(vals[1:3]) #square roots of the subset
## [1] 2.073644 2.664583 2.509980
vals[c(5,3,2)] # a subset of elements 5,3,2 - note the ordering
## [1] 3.2 6.3 7.1
There are many different data types in R: character, logical, integer etc - too many to cover here.
There are many different data classes in R: Vectors, Matrices, Factors, Lists
When you install R / RStudio it comes with a large number of tools already (refereed to as base functionality).
However, one of the joys of R is the community of users. Users share what they do and create in R in a number of ways. One of these is through packages. Packages are collections of related functions that have been created, tested and supported with help files. These are bundled into a package and shared with other R users via the that users can download from the CRAN repository.
There are 1000s of packages in R. These contain set of tools and can be written by anyone. The number of packages is continually growing. When packages are installed these can be called as libraries. The background to R, along with documentation and information about packages as well as the contributors, can be found at the R Project website http://www.r-project.org.
Packages can be found at the CRAN website - https://cran.r-project.org/web/packages/:
Users install the package once to mount it on their computer, and then it can be called in R scripts as required.
The basic operations are
library function to use the package tools
So a typical way to do this is
install.packages("<package name>"", dep = TRUE)
Replacing "<package name>" with "tidyverse" installs the tidyverse package and all of is dependencies (dep = TRUE). You should install the tidyverse package and load using the library function as in the code snippets below. Note the messages telling you which dependency packages are also installed. The tidyverse - see https://www.tidyverse.org - is a collection of R packages designed for data science. We all live in the tidyverse now…!
install.packages("tidyverse", dep = TRUE)
After you have responded to the request from R / RStudio to set a mirror - a site from which to download the package - pick the nearest one! - then the package this can be loaded at the top of an R script
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.1 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::vars() masks ggplot2::vars()
Also note that packages can be also be downloaded using the RStudio menu: Tools > Install Packages…
I prefer the command line option - if several packages need tro be loaded they can be stored in a script, and passed on to other users.
Packes can also be installed from digital places like GitHub or Bioconductor.
There are a number of ways of loading data into you R session:
.csv file).rda or .RData extension).csv format dataThe base installation of R comes with core functions for reading .txt, csv and other tabular formats. To load data from local files you need to point R / RStudio to the directory that contains the local file. One way is to use the setwd() function as in the below
## Mac
setwd("/Users/geoaco/Desktop/")
## Windows
setwd("C:\\")
Another is to use the menu system
Session > Set Working Directory … which give you options to chose from.
However you do it you should now set your working directory to the folder that contains the file census_data.csv and run the code below to load the data:
cen.dat <- read.csv("census_data.csv")
This loads a data table of 40 rows and 41 columns. These are Unitary Authorities (local government) in the East of England region in the UK. You can inspect the data in a few ways.
## dimensions - rows and columns
dim(cen.dat)
## [1] 40 41
## column / variable names
names(cen.dat)
## [1] "Name" "Code"
## [3] "OnePerson" "MarriedDependents"
## [5] "MarriedNoDependent" "SameSexPartnershipDependent"
## [7] "SameSexPartnershipNoDependent" "CohabitingCoupleDependent"
## [9] "CohabitingCoupleNoDependent" "LoneParentDependent"
## [11] "LoneParentNoDependent" "MultiPersonFTStudents"
## [13] "MultiPersonOther" "NotDeprived"
## [15] "Deprived1" "Deprived2"
## [17] "Deprived3" "Deprived4"
## [19] "White" "Mixed"
## [21] "Asian" "Black"
## [23] "Other" "Owned"
## [25] "SharedOwned" "SocialRented"
## [27] "PrivateRented" "LivingRentFree"
## [29] "ManagersDirectors" "Professionals"
## [31] "AssociateProfs" "Administrative"
## [33] "SkilledTrades" "CaringLeisureService"
## [35] "Sales" "PlantMachine"
## [37] "Elementary" "SocialGrade_AB"
## [39] "SocialGrade_C1" "SocialGrade_C2"
## [41] "SocialGrade_DE"
## look at the first 6 rows and 7 columns
cen.dat[1:6,1:7]
## Name Code OnePerson MarriedDependents MarriedNoDependent
## 1 Amber Valley E07000032 0.2786904 0.1578637 0.3167541
## 2 Ashfield E07000170 0.2803401 0.1477489 0.2985412
## 3 Bassetlaw E07000171 0.2804246 0.1547821 0.3246691
## 4 Blaby E07000129 0.2609471 0.1852350 0.3239932
## 5 Bolsover E07000033 0.2902655 0.1476479 0.3033444
## 6 Boston E07000136 0.2812649 0.1416584 0.3171742
## SameSexPartnershipDependent SameSexPartnershipNoDependent
## 1 0.000209141 0.001483003
## 2 0.000137441 0.001178064
## 3 0.000104894 0.001111880
## 4 0.000129246 0.001318306
## 5 0.000213408 0.001128014
## 6 0.000109926 0.000989337
## use the sumamry function for the first 7 columns
summary(cen.dat[,1:7])
## Name Code OnePerson MarriedDependents
## Amber Valley: 1 E06000015: 1 Min. :0.2380 Min. :0.1149
## Ashfield : 1 E06000016: 1 1st Qu.:0.2693 1st Qu.:0.1517
## Bassetlaw : 1 E06000017: 1 Median :0.2805 Median :0.1610
## Blaby : 1 E06000018: 1 Mean :0.2827 Mean :0.1639
## Bolsover : 1 E07000032: 1 3rd Qu.:0.2916 3rd Qu.:0.1773
## Boston : 1 E07000033: 1 Max. :0.3617 Max. :0.2142
## (Other) :34 (Other) :34
## MarriedNoDependent SameSexPartnershipDependent
## Min. :0.1729 Min. :0.000e+00
## 1st Qu.:0.2958 1st Qu.:9.705e-05
## Median :0.3170 Median :1.231e-04
## Mean :0.3071 Mean :1.307e-04
## 3rd Qu.:0.3316 3rd Qu.:1.681e-04
## Max. :0.3662 Max. :2.990e-04
##
## SameSexPartnershipNoDependent
## Min. :0.0004686
## 1st Qu.:0.0009863
## Median :0.0011122
## Mean :0.0011189
## 3rd Qu.:0.0013016
## Max. :0.0016589
##
The default for read.csv is that the file has a header (i.e. the first row contains the names of the columns) and that the separator between values in any record is a comma. However these can be changed depending on the nature of the file you are seeking to load into R. A number of different types of files can be read into R.You should examine the help files for reading data in different formats. Enter ??read to see some of these listed. You will note that read.table and write.table require more parameters to be specified than read.csv and write.csv.
You can also load R binary files. These have the advantage of being very efficient at storing data and quicker to load than for example, .csv files
load("census.rda")
## use ls to see what is loaded
ls()
You should see that a variable called census has been loaded. It is the same as the data read into cen.dat. You can explore the census R object if you want t using the functions above that were applied to cen.dat.
You can use the read.csv, read.tab;e and load functions to read data directly from a URL.
url <- url("http://www.people.fas.harvard.edu/~zhukov/Datasets.RData")
load(url)
ls()
The repmis package allows you to store R binary data on Github and make it accessible to your scripts (note the .RData format):
library(repmis)
# load the data
source_data("https://github.com/lexcomber/CAS_GW_Training/blob/master/Liudaogou.RData?raw=True")
As you work with R you will want to use all kinds of different data formats - from different flavours of data table .CSV, Excel, SPSS to explicitly geographical data such as shapefiles and rasters. These can all be loaded directly into R using functions from different packages. There are too many to cover fcomprehensively. But generally if there is a data format out there, there is alo a tool to get it into R!
The foreign package can be used to load many file types (e.g. EXCEL and SPSS) and a number of different approaches for reading data types are listed here: https://www.r-bloggers.com/read-excel-files-from-r/
Data can be written into a Comma Separated Variable file using the command write.csv and then read back into a different variable, as follows:
write.csv(census, file = "new_census.csv")
This writes a .csv file into the current working directory. If you open it using a text editor or a spreadsheet software, you will see that it has the expected column plus the index for each record. This is because the default for write.csv includes row.names = TRUE. Again examine the help file for this function.
write.csv(census, file = "test.csv", row.names = F)
It is possible to save variables that are in your workspace to a designated .rda or .RData file. This can be loaded at the start of your next session. Saving your workspace saves everything that is present in your workspace - as listed by ls() - whilst the save command allows you to specify what variables you wish to save.
There are a number of ways to do this:
save.image functionsave.image(file = "mywkspcae.RData")
save function# this will save everything in the workspace
save(list = ls(), file = "MyData.RData")
# this will save just appling
save(list = "census", file = "MyData.RData")
# this will save vals and census
save(list = c("census", "vals"), file = "MyData.RData")
In the previous sections you loaded a number of data tables. To make sure you have the right data you should clear your workspace and re-load the folloiwng data. To clear our workspace either go to Session > Clear Workspace…* or enter
rm(list = ls())
Load the census attributes again for the the UAs in the English East Midlands
load("census.rda")
The data are all in percentages and there are clear groups related to
The simplest way of summarising the data is to examine the distributions within groups. For example the code below uses the summary() function to look at the distribution of Household deprivation:
summary(census[,14:18])
## NotDeprived Deprived1 Deprived2 Deprived3
## Min. :0.3266 Min. :0.2933 Min. :0.1288 Min. :0.01954
## 1st Qu.:0.4001 1st Qu.:0.3136 1st Qu.:0.1651 1st Qu.:0.03132
## Median :0.4480 Median :0.3206 Median :0.1877 Median :0.04025
## Mean :0.4405 Mean :0.3228 Mean :0.1903 Mean :0.04316
## 3rd Qu.:0.4833 3rd Qu.:0.3287 3rd Qu.:0.2130 3rd Qu.:0.05735
## Max. :0.5482 Max. :0.3636 Max. :0.2604 Max. :0.08053
## Deprived4
## Min. :0.0008666
## 1st Qu.:0.0021438
## Median :0.0028259
## Mean :0.0033045
## 3rd Qu.:0.0041442
## Max. :0.0101442
The hist function can be used visually summarise distributions of values amongst the UAs. Try running the code below;
## set the plot paramteters
# 1 row and 4 columns
par(mfrow = c(1,4))
hist(100*census$OnePerson, xlim = c(0,100),
xlab = "% of Households", main = "One person households", col = "red")
hist(100*census$MarriedNoDependent, xlim = c(0,100),
xlab = "% of Households", main = "Married with dependents", col = "#FFFFBF")
hist(100*census$CohabitingCoupleNoDependent, xlim = c(0,100),
xlab = "% of Households", main = "Cohabiting with dependents", border = "#FDAE61")
hist(100*census$LoneParentDependent, xlim = c(0,100),
xlab = "% of Households", main = "Lone parent with dependents", border = "cyan")
## reset par
par(mfrow = c(1,1))
The boxplot() function also provides a useful way of summarising data:
boxplot(100*census$PrivateRented, main = "The distribution of Private Rented Households", ylab="% of all Households")
But it can also be used to explore the interaction between variables and different factors. For the sake of argument the code below labels each UA areas with the dominant social grade class
census$social.class <- names(census[38:41])[apply(census[, 38:41], 1, which.max)]
census$social.class <- sub("SocialGrade_","",census$social.class)
Now we can use this to explore the distribution of different types of housing with dominant social grade. Note the different parameters that are being passed to boxplot. You should exolore these in your own time.
par(mar = c(3,8,3,3))
# plot 1
boxplot(100*census$Owned~census$social.class,
horizontal = TRUE, outline = FALSE, lwd = 0.5, las=1,
col = c("#D7191C","#FFFFBF", "#2B83BA"),
main="% of Owned property by dominant social grades")
# plot 2
boxplot(100*census$SocialRented~census$social.class,
horizontal = TRUE, outline = FALSE, lwd = 0.5, las=1,
col = c("#D7191C","#FFFFBF", "#2B83BA"),
main="% of Social Rented property by dominant social grades")
We may also need to be aware of the ecological fallacy here - the dominant social grade may not be dominant by that much - consider a 30:30:40 split. So the 10% or so households in social housing from areas with dominant social group AB may not actually be AB households themselves.
Finally it is possible to examine the correlations between data using the cor function:
round(cor(census[, c(24:28, 38:41)]), 3)
## Owned SharedOwned SocialRented PrivateRented
## Owned 1.000 0.050 -0.913 -0.879
## SharedOwned 0.050 1.000 -0.244 0.130
## SocialRented -0.913 -0.244 1.000 0.611
## PrivateRented -0.879 0.130 0.611 1.000
## LivingRentFree -0.223 0.264 0.060 0.297
## SocialGrade_AB 0.610 0.322 -0.663 -0.431
## SocialGrade_C1 0.306 0.211 -0.395 -0.116
## SocialGrade_C2 0.265 -0.222 -0.118 -0.386
## SocialGrade_DE -0.761 -0.284 0.786 0.576
## LivingRentFree SocialGrade_AB SocialGrade_C1 SocialGrade_C2
## Owned -0.223 0.610 0.306 0.265
## SharedOwned 0.264 0.322 0.211 -0.222
## SocialRented 0.060 -0.663 -0.395 -0.118
## PrivateRented 0.297 -0.431 -0.116 -0.386
## LivingRentFree 1.000 0.057 -0.311 0.080
## SocialGrade_AB 0.057 1.000 0.572 -0.477
## SocialGrade_C1 -0.311 0.572 1.000 -0.516
## SocialGrade_C2 0.080 -0.477 -0.516 1.000
## SocialGrade_DE 0.034 -0.930 -0.690 0.242
## SocialGrade_DE
## Owned -0.761
## SharedOwned -0.284
## SocialRented 0.786
## PrivateRented 0.576
## LivingRentFree 0.034
## SocialGrade_AB -0.930
## SocialGrade_C1 -0.690
## SocialGrade_C2 0.242
## SocialGrade_DE 1.000
Here we can see that there ae some very strong negatve and positive associatons bstweenthe proportions (percentages) of households under different types of tenure and social grade.
Notice that the c(24:28, 38:41) was used to select the data columns to compare. The round function was used to limit the number of decimal places of the output. The names(census) command can help decide which variables to consider.
We can examine these in further detail using the plot() function. This plots all of the variables against each other and is useful for displaying how all variables correlate.
plot(census[,c(24:28, 38:41)], cex = .6)
It is clear that there are some very strong trends between variables here that we may be interested in in our analysis.
This section has provided some rubrics for working in R (scripts, etc) and showed how to get data in and out of R and developed some simple visualisations of data.
In later sessions we will expand on this but using the ggplot2 package. This comes with the tidyverse and although it involves a bit of a learning curve it produces excellent visualisations and is extremebly controllable. However, there is plenty of help and advice on the internet and in core text books, and the information provided here.