March Workshop for R Novices

This is the FULL cribsheet for the Novice workshop. Use this document to follow along during the session. The published version on RPubs is here: http://rpubs.com/crt34/march-workshop-full

Schedule

1st Half

– 6.30-6.35pm: A. RStudio - Basics
– 6.35-6.40pm: B. Pre-loaded data - Examine
– 6:40-6:50pm: C. Make your own data - Geo Chart
– 6:50-6:55pm: D. ggplot2 - Set-up
– 6:55-7:15pm: E. ggplot2 - Data Viz + Extensions

Break

– 7:15-7.30pm

2nd Half

– 7.30-8.15pm: F. Machine Learning technique - Clustering + Extension

“There’s more than one way to skin a cat.”

A. RStudio - Basics

(i) Open & Save new R Script

(ii) Running code

Windows: Alt+Enter or highlight code and click ‘Run’
Mac: cmd+Enter or highlight code and click ‘Run’

(iii) Creating R Objects

#Create object
object <- 3 + 5 

#Call object
object

## [1] 8

B. Pre-loaded Data - Examine

The default installation of R comes with several data sets. Bring up the listing of pre-loaded data sets:

data()

Some of the more popular data sets used in online demos & tutorials are:

data("iris")
data("mtcars")
data("longley")
data("USArrests")
data("VADeaths")

Here are some useful R commands for top-line exploration of a data set (insert the name of the dataset in the brackets, e.g. class(iris)):

class()
dim()
str()
summary()
head()
tail()
View()
?<name of data set>

C. Make your own data - Geo Chart

(i) Let’s create a data set by hand of the popularity of specific countries as holiday destinations for the room:

Country <-    c("United Kingdom", "France", "Spain", "Germany", "US", "Australia", "Thailand" )
Popularity <- c(20, 25, 22, 15, 5, 5, 5)
geodata <- data.frame(Country, Popularity)

View(geodata)
class(geodata)

(ii) Install the “googleVis” package

Download “googleVis” package from CRAN via install.packages:

install.packages("googleVis")

Can check package has installed via RStudio “Packages” tab.

(iii) Load “googleVis” to use in current session

library(googleVis)

(iv) Using the “gvisGeoChart” function from the “googleVis” package. Bring up help on the function:

?gvisGeoChart
args(gvisGeoChart)

Key arguments:
* data = a data.frame, where at least one column has location name.
* locationvar = column name of data with the geo locations to be analysed.
* colorvar = column name of data with the optional numeric column used to assign a color to this marker.

(v) Create Geo Chart from ‘geodata’ data frame object:

geochart <- gvisGeoChart(geodata, 
                         locationvar = "Country",
                         colorvar = "Popularity")
plot(geochart)

D. ggplot2 - Set-Up

Extremely popular & widely used Graphic System in R created by Hadley Wickham. An implementation of the Grammar of Graphics concepts developed by Leland Wilkinson.

Other key Graphic Systems in R are the base graphics package and lattice package.

(i) Install the “ggplot2” package:

Download ggplot2 package from CRAN via install.packages:

install.packages("ggplot2")

Can check package has installed via RStudio “Packages” tab.

(ii) Load “ggplot2”" to use in current session

library(ggplot2)

(iii) ggplot2 Documentation & example data sets

Documentation: http://docs.ggplot2.org/current/

?ggplot

Load and examine ggplot2 example data sets:

data("economics", "diamonds")

E. ggplot2 - Data Viz

Key Plotting Layers
* data = a data.frame
* aes = short for aesthetics, defines the data to be mapped to the aesthetics of the plot
* geom_xxx = short for geometric objects, defines the type of plot produced

(i) Line Chart

Plot a line graph of population against time:

line.graph <- ggplot(data = economics, aes(x = date, y = pop)) + geom_line()

plot(line.graph)

(ii) Bar Chart

Plot a bar chart showing the count of diamonds by cut:

bar.chart <- ggplot(data = diamonds, aes(x = cut)) + geom_bar()

plot(bar.chart)

(iii) Box Plot

Plot a box plot showing the distribution of prices for each diamond cut:

box.plot <- ggplot(data = diamonds, aes(x = cut, y = price)) + geom_boxplot()

plot(box.plot)

(iv) Scatterplot

Make a scatterplot showing the relationship between two variables, price and carat:

scatterplot <- ggplot(data = diamonds, aes(x = price, y = carat)) + geom_point()

plot(scatterplot)

Extensions!!

– Add Labels by adding these elements to your plot:

+ ggtitle("insert title here")
+ xlab("insert x-axis label here")
+ ylab("insert y-axis label here")

– Change colours: R automatically comes with a base colour palette, “R colors”. There are 657 pre-made colours with given names which can be accessed by the colors() command. This is a useful doc for colour swatches: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Try changing the colour (outline) or fill (solid area) of the geom_xxx, e.g.

+ geom_line(colour = "tomato")
+ geom_bar(colour = "slategrey", fill = "peachpuff")
+ geom_boxplot(colour = "navy", fill = "oldlace")

– Subset existing plots with colour demarcation: Sub-setting your plot by a factor variable by specifying an aes data mapping in the geom_xxx.
For example:

bar.chart.subset <- ggplot(data = diamonds, aes(x = cut)) + geom_bar(aes(fill = clarity))

plot(bar.chart.subset)

Another example:
NOTE: calling the enhanced scatterplot below takes a little while due to the volume of data, give it a moment!

scatterplot.subset <- ggplot(data = diamonds, aes(x = price, y = carat)) + geom_point(aes(colour = cut))

plot(scatterplot.subset)

– Faceting: Facets display subsets of the data in different panels.

Try reproducing the diamonds bar chart but into 7 panels, each representing the chart for 1 of the 7 diamond “color” (D-J):

bar.chart.facet <- ggplot(data = diamonds, aes(x = cut)) + geom_bar() + facet_grid(color~.)

Additionally, add an aes(fill = color) inside the geom_bar for extra demarcation:

bar.chart.facet <- ggplot(data = diamonds, aes(x = cut)) + geom_bar(aes(fill = color)) + facet_grid(color~.)

plot(bar.chart.facet)

F. Machine Learning technique - Clustering

Intro + Steps

Clustering is a classic Machine Learning (ML) problem, where the task is to cluster your data points into groups based on similar properties. Common business applications are Market/Customer/Product Segmentation. It is classified as an ‘Unsupervised’ ML problem (vs ‘Supervised’), because your data is ‘unlabeled’ (vs ‘labeled’), i.e. no indication of a target classification. Thus this ML algorithm is used to look for patterns and discover/uncover structure within the data.

There are several classes of clustering methods including k-means and PAM Partioning Around Medoids, both of which require upfront specification of the number of clusters in the final solution. In this exercise, we’ll use the Hierarchical Agglomerative Clustering method, which outputs a complete set of solutions (from a single cluster solution to n cluster solution (n = no. of data points)).

For all methods the process of grouping the data points into clusters is based on a numerical measure of ‘distance’ between points, to quantify dis/similarity. In this exercise we calculate Euclidean Distance measures between pairs of data points (based on the square root of the sum of squares of the differences between corresponding elements of two vectors).

What: Cluster a set of 32 cars models based on road test performance stats (the pre-loaded mtcars data set)
How: Hierarchical Agglomerative Clustering algorithm
Output: Dendrogram data viz, membership classification

R documentation for the hclust Hierarchical Agglomerative Clustering function that we’re using:
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html

DENODROGRAM PREVIEW: The tree construct visualising the complete set of clustering solutions from single cluster (root) to 32 cluster (leaves), and a manually chosen 6 cluster solution:

Steps:
(i) Load & Explore data
(ii) Clean data
(iii) Standardise data
(iv) Calculate Euclidean Distance Matrix
(v) Calculate Clusters
(vi) Make Dendrogram data viz
(vii) Define appropriate no. of Clusters
(viii) Charcterise each Cluster by their variable statistics
(ix) Create final Cluster Membership data

NOTE: notice how we don’t need to install & load any packages for this exercise? All the functions we’re using here are base R!

(i) Load & Explore data

Also note the critera on which we’re clustering the cars:

data(mtcars)
View(mtcars)
?mtcars

(ii) Clean data

Take out the binary variables, “vs” and “am” by removing a vector referencing their specific column positions from the original dataset, i.e. column 8 and 9:

mtcars1 <- mtcars[, -c(8, 9)]

#Eyeball cleaned data
View(mtcars1)

(iii) Standardise data

A common data pre-processing step to give the same importance to all the variables, we normalise the data so that each variable/column has a mean of 0, and a comparable range of values:

#Calculate the column-wise medians
medians <- apply(mtcars1, 2, median)

#Calculate the column-wise mean average standard deviation (mads)
mads <- apply(mtcars1, 2, mad)

#Update the mtcars1 data set by scaling each column by it's median and mad
mtcars2 <- scale(mtcars1, center = medians, scale = mads)

print(mtcars2, digits=2)

##                      mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
## Mazda RX4          0.333  0.00 -0.26 -0.17  0.29 -0.92 -0.88  0.00  1.35
## Mazda RX4 Wag      0.333  0.00 -0.26 -0.17  0.29 -0.59 -0.49  0.00  1.35
## Datsun 710         0.665 -0.67 -0.63 -0.39  0.22 -1.31  0.64  0.00 -0.67
## Hornet 4 Drive     0.407  0.00  0.44 -0.17 -0.87 -0.14  1.22 -0.67 -0.67
## Hornet Sportabout -0.092  0.67  1.17  0.67 -0.77  0.15 -0.49 -0.67  0.00

(iv) Calculate Euclidean Distance Matrix

Calculate a matrix of dis/similarity measures between each pair of cars using the dist function:

#The 'method' argument specifies the distance measure to be used from number of options - here "euclidean" is chosen
mtcars3 <- dist(mtcars2, method = "euclidean")

print(mtcars3, digits=2)

(v) Calculate Clusters

Once the Euclidean Distances between every pair of cars have been calculated, we call the hclust function, i.e. this is where we apply the algorithm to cluster the 32 cars based on their dis/similarity:

#Default agglomeration method is Complete Linkage (distance between clusters based on largest existing pairwise dissimilarity)

#Specifying Ward's minimum variance method here which minimises the total within-cluster variance (see hclust documentation)

clusters <- hclust(mtcars3, method = "ward.D2")

(vi) Make Dendrogram data viz

You can plot as-is:

plot(clusters)

Or plot where the labels are bottom-aligned:

plot(clusters, hang = -1)

(vii) Define appropriate no. of Clusters

Choosing a 6 cluster solution here, and visualising this on the dendrogram:

rect.hclust(clusters, 6)

Use cutree function to cut the tree into the 6 groups of data (clusters):

clusters.6 <- cutree(clusters, 6)

Use table function to build a contingency table of the counts for each levels of the new clusters.6 factor variable, i.e. how many cars are in each cluster:

table(clusters.6)

## clusters.6
## 1 2 3 4 5 6 
## 5 7 7 4 3 6

(viii) Charcterise each Cluster by their variable statistics

Calculate & Interpret variable stats in their standardised scale:

#Use aggregate command to compute chosen summary statistics (mean) which is then applied to all subsets (the 6 clusters) of the scaled mtcars data
means.scaled <- aggregate(mtcars2, list(clusters.6), mean)

#Bring up info, formatted to 2 d.p
options(digits = 2)
means.scaled

##   Group.1   mpg   cyl  disp     hp   drat    wt   qsec  gear  carb
## 1       1  0.10  0.00 -0.26  0.067  0.223 -0.39 -0.335  0.13  1.62
## 2       2  0.48 -0.48 -0.26 -0.352 -0.153 -0.50  1.595 -0.29 -0.39
## 3       3 -0.45  0.67  0.85  0.610 -0.916  0.47 -0.275 -0.67  0.29
## 4       4 -0.85  0.67  1.03  1.936 -0.028  0.28 -1.852  0.00  2.02
## 5       5 -1.36  0.67  1.86  1.215 -0.911  2.63  0.021 -0.67  1.35
## 6       6  2.01 -0.67 -0.78 -0.616  0.790 -1.89  0.486  0.22 -0.34

Calculate & Interpret variable stats in their original scale:

#Use aggregate command to compute chosen summary statistics (mean) which is then applied to all subsets (the 6 clusters) of the un-scaled mtcars data
means.orig <- aggregate(mtcars1, list(clusters.6), mean)

#Bring up info
means.orig

##   Group.1 mpg cyl disp  hp drat  wt qsec gear carb
## 1       1  20 6.0  160 128  3.9 3.0   17  4.2  4.4
## 2       2  22 4.6  160  96  3.6 2.9   20  3.6  1.4
## 3       3  17 8.0  316 170  3.0 3.7   17  3.0  2.4
## 4       4  15 8.0  340 272  3.7 3.5   15  4.0  5.0
## 5       5  12 8.0  457 217  3.1 5.3   18  3.0  4.0
## 6       6  30 4.0   87  76  4.3 1.9   18  4.3  1.5

(ix) Create final Cluster Membership data

Bind the new 6 cluster factor variable to the original (cleaned) mtcars data to create an additional column indicating the computed Cluster Membership of each car:

mtcars.membership <- cbind(clusters.6, mtcars1)

Display the resulting data set:

mtcars.membership

##                   clusters.6 mpg cyl disp  hp drat  wt qsec gear carb
## Mazda RX4                  1  21   6  160 110  3.9 2.6   16    4    4
## Mazda RX4 Wag              1  21   6  160 110  3.9 2.9   17    4    4
## Datsun 710                 2  23   4  108  93  3.9 2.3   19    4    1
## Hornet 4 Drive             2  21   6  258 110  3.1 3.2   19    3    1
## Hornet Sportabout          3  19   8  360 175  3.1 3.4   17    3    2
## Valiant                    2  18   6  225 105  2.8 3.5   20    3    1

Extension!!

Try one of the many packages available to created enhanced dendrogram visualisations. Here we’ll create coloured leaves:

install.packages("sparcl")

library(sparcl)

ColorDendrogram(clusters, y = clusters.6, labels = names(clusters.6), branchlength = 5)

Happy Coding and Happy Easter!!