An introduction to R

Saturday, July 19, 2014

Agenda

Introductory stuff
Connecting to SQL Server
Basic analysis with data.table and ggplot2
Building interactive reports
Dynamically built documents
Q&A & Conclusion

Introduction

What's your background?

R-some puns

I'm giving away a copy of Tribal SQL to the best 'R' or R pun

Questions

Can be asked throughout, but keep substantial questions to the end to ensure we can stay on time

About Steph Locke aka Me

Cardiff
BI & Risk Analytics Developer
Community organiser

SQLRelay.co.uk & your user groups

About R

R is a programming language for analysing data. Fundamentally, R allows you to directly code analysis, reports, and statistical models.

Why is that a good thing?

Code is syntax – easy to apply principals to new language
Time reduction – less clicking means more time thinking
Standards – unit testing, code review, source control etc

Great, so what else?

It's free and totally customisable
It works on Windows and with SQL Server
It looks awesome, so you look awesome

Some R niggles

Like any technology there are some current quirks or limitations

Based on S which was written in the 70s
For a long time, was the sole province of academic statisticians
It's open so there are competing coding paradigms (also a benefit!)
It can take a while to get the hang of googling for R
No robust MDX package as yet

Installing R

Install R r-project.org
Install Rstudio Rstudio.com
You're done

Notes on installing R

R can typically be installed without UAC being triggered
R uses packages to extend itself so you need to be able to download zip files
Rstudio will update on a relatively regularly basis and does hit UAC
R and Rstudio work better on linux as it's easier to get the additional components that can be required for some packages
I would also recommend the installation of
- MikTex (Windows variant of LaTeX) and Pandoc for documentation
- Git for source control (integrates with TFS!)

Working with SQL Server

First things first - get a SQL Server driver

Secondly - install the RODBC package for R and activate it

install.packages("RODBC",repos="http://cran.ma.imperial.ac.uk/")

## Installing package into 'C:/Users/OzandSteph/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)

## package 'RODBC' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\OzandSteph\AppData\Local\Temp\RtmpsjOV8U\downloaded_packages

# Activate the ODBC package for use
library("RODBC")

Thirdly - connect and grab some data!

# Connection courtesy of Redgate bit.ly/1jrl7V6

# Build a connection to the DB for reuse
azure <- odbcDriverConnect(
          "Driver={SQL Server Native Client 11.0};
          Server=mhknbn2kdz.database.windows.net;
          Database=AdventureWorks2012;
          Uid=sqlfamily;
          Pwd=sqlf@m1ly;")

# Store simple SELECT statement results
person <- sqlQuery(azure, 
         "SELECT TOP 5 * FROM [Person].[Person]")

Inline example of dynamic results: 5 records returned on previous page and details below

#Show results
person

##   BusinessEntityID PersonType NameStyle Title FirstName MiddleName
## 1                1         EM         0  <NA>       Ken          J
## 2                2         EM         0  <NA>     Terri        Lee
## 3                3         EM         0  <NA>   Roberto       <NA>
## 4                4         EM         0  <NA>       Rob       <NA>
## 5                5         EM         0   Ms.      Gail          A
##     LastName Suffix EmailPromotion AdditionalContactInfo
## 1    Sánchez     NA              0                    NA
## 2      Duffy     NA              1                    NA
## 3 Tamburello     NA              0                    NA
## 4    Walters     NA              0                    NA
## 5   Erickson     NA              0                    NA
##                                                                                                                                                          Demographics
## 1 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 2 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 3 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 4 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 5 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
##                                rowguid ModifiedDate
## 1 92C4279F-1207-48A3-8448-4636514EB7E2   2003-02-08
## 2 D8763459-8AA8-47CC-AFF7-C9079AF79033   2002-02-24
## 3 E1A2555E-0828-434B-A33B-6F38136A37DE   2001-12-05
## 4 F2D7CE06-38B3-4357-805B-F4B6B71C01FF   2001-12-29
## 5 F3A3F6B4-AE3B-430C-A754-9F2231BA6FEF   2002-01-30

Notes on connecting

You can use DSNs, I've just done it explicitly for proof
You can use the TrustedConnection property for AD authenticating logins
You can write functions to hide the different steps to make it quicker in future
Multiple statements can be troublesome - try to keep to just one

Basic analysis - data.table & ggplot2

data.table and ggplot2 are probably the two most useful packages in R!

data.table

data.table enhances data.frames (tables) to give you grouping, filtering, joins, aggregations and column names.
It's quite SQL-like and is designed to be very fast.

ggplot2

ggplot2 implements a grammar of graphics (gg)
Simple, consistent functions to build charts

data.table

Grab some data

# SQL statement
sales <- sqlQuery(azure, "
SELECT 
soh.OrderDate,
t.Name as Territory,
cr.countryregioncode as Code,
cr.Name as Region,
sum(TotalDue) as Value,
count(*) as Volume
FROM [Sales].[SalesOrderHeader] soh
INNER JOIN [Sales].[SalesTerritory] t 
          on t.territoryid=soh.territoryid
INNER JOIN [Person].[CountryRegion] cr 
          on cr.countryregioncode=t.countryregioncode
WHERE  soh.OrderDate < '2008-07-01'
GROUP BY soh.OrderDate,
t.Name ,
cr.Name,
cr.countryregioncode")

Simples

# Convert to data.table
sales<-data.table(sales)

# Basic GROUP BY aggregation - returns count by region
sales[,.N,by=Region]

##            Region    N
## 1:  United States 1822
## 2:         Canada  607
## 3:        Germany  710
## 4:         France  695
## 5:      Australia 1027
## 6: United Kingdom  758

Row extraction

# .SD returns the subset of rows and is filtered by which.max
kable(sales[,.SD[which.max(Value)],by=Region])

Region	OrderDate	Territory	Code	Value	Volume
United States	2006-09-01	Southwest	US	1158625	36
Canada	2007-12-01	Canada	CA	803033	37
Germany	2007-08-01	Germany	DE	297738	16
France	2007-08-01	France	FR	803118	25
Australia	2007-09-01	Australia	AU	311014	21
United Kingdom	2007-09-01	United Kingdom	GB	561617	15

Aggregation

# ORDER, multiple aggregations, and aliasing
kable(sales[order(OrderDate),list(Value=sum(Value),Volume=sum(Volume))
            ,by=list(Year=year(OrderDate))])

Year	Value	Volume
2005	12693251	1379
2006	34463848	3692
2007	47171490	12443
2008	28832019	12975

Joins

salesbyyear<-sales[,list(AnnualValue=sum(Value),AnnualVolume=sum(Volume))
                   ,keyby=list(Year=year(OrderDate))]
salesbyregion<-sales[,list(Value=sum(Value),Volume=sum(Volume))
                   ,by=list(Region,Year=year(OrderDate))]
#PK assignment
setkey(salesbyregion,Year)
kable(salesbyyear[salesbyregion[Region=="United Kingdom"],
            list(Proportion=percent(Volume/AnnualVolume))])

Year	Proportion
2005	6.96%
2006	8.15%
2007	10.9%
2008	10.7%

ggplot2

Quick chart

# Activate ggplot2
library(ggplot2)

# Quick plot (qplot)
qplot(year(OrderDate), Value,fill=Region,
      data=sales[order(Region)],geom="bar",stat="identity")

plot of chunk unnamed-chunk-10

Chart anatomy

# Chart backbone
ggplot(data=sales[order(Region)],aes(x=year(OrderDate),y=Value,fill=Region))+
  #bar chart type, identity uses contents not count
  geom_bar(stat="identity")+
  #themeing
  scale_fill_brewer(type="qual")+
  theme_minimal()

plot of chunk unnamed-chunk-11

Making a trellis

ggplot(data=sales,aes(x=year(OrderDate),y=Value,fill=Region))+
  geom_bar(stat="identity")+
  # Trellis command
  facet_grid(.~Region)+
  theme(legend.position="none",axis.text.x=element_text(angle=45))

plot of chunk unnamed-chunk-12

Formatting

library(scales)
ggplot(data=sales,aes(x=year(OrderDate),y=Value/1000000,fill=Region))+
  geom_bar(stat="identity")+
  facet_grid(.~Region)+
  theme_minimal(base_size = 10) +
  theme(legend.position="none",axis.text.x=element_text(angle=45,hjust=1))+
  scale_y_continuous(label=dollar)+
  labs(x="Year",y="Value ($m)")

plot of chunk unnamed-chunk-13

Interactive documents / reports

Use knitr with LaTeX or markdown to produce static documents
Use shiny to make properly interactive reports

Shiny overview

A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required

Uses magic and pixie dust to transform your R code into a webpage that provides end-users with the ability to analyse data interactively

Shiny paradigm

It's call and response!

You have client-side controllers that tell server-side code to recalculate content
There are R functions for input controls, preparing content, and functions that output content.

Shiny Server

You need a Linux server for this and your applications can be made internally or externally available.

Shiny is by default open, use either a password system or utilise AD integration in the pro edition

Demo time!

library(shiny)
system.file("examples", package="shiny")
runExample("03_reactivity") # a reactive expression
runExample("05_sliders") # sliders

Dynamic documents

CONCEPT: modular documents that contain text and analysis that update

This is pretty much impossible in the Office paradigm.

You can make refreshable Excel spreadsheets but stuff with words as well is hard to update
It's really hard to collaborate on a single Word document
PowerPoint requires a lot of clicking and C&Ping

R integrates with LaTeX, markdown, and pandoc to enable the blending of code and content to enable documents to be produced from scratch each time.

What are the benefits?

Change a bit of the code and the whole document updates
Modular development enables easier development (particularly in a collaborative setting)
Reduction in errors caused by C&Ping
Consistent looks and feel of documents and reports
Documented as you go
No stress about updating a document
Variety of export formats

How's it done?

You can write LaTeX and markdown (think wikipedia and SO) outside of R
There are packages like knitr designed to allow you to embed your R analysis in a variety of other languages
Work in Rstudio to benefit from their full integration of these packages

Agenda

Introduction

What's your background?

R-some puns

Questions

About Steph Locke aka Me

About R

Why is that a good thing?

Great, so what else?

Some R niggles

Installing R

Notes on installing R

Working with SQL Server

Notes on connecting

Basic analysis - data.table & ggplot2

data.table

ggplot2

data.table

Grab some data

Simples

Row extraction

Aggregation

Joins

ggplot2

Quick chart

Chart anatomy

Making a trellis

Formatting

Interactive documents / reports

Shiny overview

Shiny paradigm

Shiny Server

Demo time!

Dynamic documents

What are the benefits?

How's it done?

Demo time!

Find out more

R resources

Shiny and Knitr

LaTeX

Q&A

Conclusion

Follow up with me