Saturday, July 19, 2014

Agenda

  • Introductory stuff
  • Connecting to SQL Server
  • Basic analysis with data.table and ggplot2
  • Building interactive reports
  • Dynamically built documents
  • Q&A & Conclusion

Introduction

What's your background?

R-some puns

  • I'm giving away a copy of Tribal SQL to the best 'R' or R pun

Questions

  • Can be asked throughout, but keep substantial questions to the end to ensure we can stay on time

About Steph Locke aka Me

  • Cardiff
  • BI & Risk Analytics Developer
  • Community organiser

SQLRelay.co.uk & your user groups

About R

R is a programming language for analysing data. Fundamentally, R allows you to directly code analysis, reports, and statistical models.

Why is that a good thing?

  • Code is syntax – easy to apply principals to new language
  • Time reduction – less clicking means more time thinking
  • Standards – unit testing, code review, source control etc

Great, so what else?

  • It's free and totally customisable
  • It works on Windows and with SQL Server
  • It looks awesome, so you look awesome

Some R niggles

Like any technology there are some current quirks or limitations

  • Based on S which was written in the 70s
  • For a long time, was the sole province of academic statisticians
  • It's open so there are competing coding paradigms (also a benefit!)
  • It can take a while to get the hang of googling for R
  • No robust MDX package as yet

Installing R

Notes on installing R

  • R can typically be installed without UAC being triggered
  • R uses packages to extend itself so you need to be able to download zip files
  • Rstudio will update on a relatively regularly basis and does hit UAC
  • R and Rstudio work better on linux as it's easier to get the additional components that can be required for some packages
  • I would also recommend the installation of
    • MikTex (Windows variant of LaTeX) and Pandoc for documentation
    • Git for source control (integrates with TFS!)

Working with SQL Server

First things first - get a SQL Server driver

Secondly - install the RODBC package for R and activate it

install.packages("RODBC",repos="http://cran.ma.imperial.ac.uk/")
## Installing package into 'C:/Users/OzandSteph/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'RODBC' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\OzandSteph\AppData\Local\Temp\RtmpsjOV8U\downloaded_packages
# Activate the ODBC package for use
library("RODBC")

Thirdly - connect and grab some data!

# Connection courtesy of Redgate bit.ly/1jrl7V6

# Build a connection to the DB for reuse
azure <- odbcDriverConnect(
          "Driver={SQL Server Native Client 11.0};
          Server=mhknbn2kdz.database.windows.net;
          Database=AdventureWorks2012;
          Uid=sqlfamily;
          Pwd=sqlf@m1ly;")

# Store simple SELECT statement results
person <- sqlQuery(azure, 
         "SELECT TOP 5 * FROM [Person].[Person]")

Inline example of dynamic results: 5 records returned on previous page and details below

#Show results
person
##   BusinessEntityID PersonType NameStyle Title FirstName MiddleName
## 1                1         EM         0  <NA>       Ken          J
## 2                2         EM         0  <NA>     Terri        Lee
## 3                3         EM         0  <NA>   Roberto       <NA>
## 4                4         EM         0  <NA>       Rob       <NA>
## 5                5         EM         0   Ms.      Gail          A
##     LastName Suffix EmailPromotion AdditionalContactInfo
## 1    Sánchez     NA              0                    NA
## 2      Duffy     NA              1                    NA
## 3 Tamburello     NA              0                    NA
## 4    Walters     NA              0                    NA
## 5   Erickson     NA              0                    NA
##                                                                                                                                                          Demographics
## 1 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 2 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 3 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 4 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 5 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
##                                rowguid ModifiedDate
## 1 92C4279F-1207-48A3-8448-4636514EB7E2   2003-02-08
## 2 D8763459-8AA8-47CC-AFF7-C9079AF79033   2002-02-24
## 3 E1A2555E-0828-434B-A33B-6F38136A37DE   2001-12-05
## 4 F2D7CE06-38B3-4357-805B-F4B6B71C01FF   2001-12-29
## 5 F3A3F6B4-AE3B-430C-A754-9F2231BA6FEF   2002-01-30

Notes on connecting

  • You can use DSNs, I've just done it explicitly for proof
  • You can use the TrustedConnection property for AD authenticating logins
  • You can write functions to hide the different steps to make it quicker in future
  • Multiple statements can be troublesome - try to keep to just one

Basic analysis - data.table & ggplot2

data.table and ggplot2 are probably the two most useful packages in R!

data.table

  • data.table enhances data.frames (tables) to give you grouping, filtering, joins, aggregations and column names.
  • It's quite SQL-like and is designed to be very fast.

ggplot2

  • ggplot2 implements a grammar of graphics (gg)
  • Simple, consistent functions to build charts

data.table

Grab some data

# SQL statement
sales <- sqlQuery(azure, "
SELECT 
soh.OrderDate,
t.Name as Territory,
cr.countryregioncode as Code,
cr.Name as Region,
sum(TotalDue) as Value,
count(*) as Volume
FROM [Sales].[SalesOrderHeader] soh
INNER JOIN [Sales].[SalesTerritory] t 
          on t.territoryid=soh.territoryid
INNER JOIN [Person].[CountryRegion] cr 
          on cr.countryregioncode=t.countryregioncode
WHERE  soh.OrderDate < '2008-07-01'
GROUP BY soh.OrderDate,
t.Name ,
cr.Name,
cr.countryregioncode")

Simples

# Convert to data.table
sales<-data.table(sales)

# Basic GROUP BY aggregation - returns count by region
sales[,.N,by=Region]
##            Region    N
## 1:  United States 1822
## 2:         Canada  607
## 3:        Germany  710
## 4:         France  695
## 5:      Australia 1027
## 6: United Kingdom  758

Row extraction

# .SD returns the subset of rows and is filtered by which.max
kable(sales[,.SD[which.max(Value)],by=Region])
Region OrderDate Territory Code Value Volume
United States 2006-09-01 Southwest US 1158625 36
Canada 2007-12-01 Canada CA 803033 37
Germany 2007-08-01 Germany DE 297738 16
France 2007-08-01 France FR 803118 25
Australia 2007-09-01 Australia AU 311014 21
United Kingdom 2007-09-01 United Kingdom GB 561617 15

Aggregation

# ORDER, multiple aggregations, and aliasing
kable(sales[order(OrderDate),list(Value=sum(Value),Volume=sum(Volume))
            ,by=list(Year=year(OrderDate))])
Year Value Volume
2005 12693251 1379
2006 34463848 3692
2007 47171490 12443
2008 28832019 12975

Joins

salesbyyear<-sales[,list(AnnualValue=sum(Value),AnnualVolume=sum(Volume))
                   ,keyby=list(Year=year(OrderDate))]
salesbyregion<-sales[,list(Value=sum(Value),Volume=sum(Volume))
                   ,by=list(Region,Year=year(OrderDate))]
#PK assignment
setkey(salesbyregion,Year)
kable(salesbyyear[salesbyregion[Region=="United Kingdom"],
            list(Proportion=percent(Volume/AnnualVolume))])
Year Proportion
2005 6.96%
2006 8.15%
2007 10.9%
2008 10.7%

ggplot2

Quick chart

# Activate ggplot2
library(ggplot2)

# Quick plot (qplot)
qplot(year(OrderDate), Value,fill=Region,
      data=sales[order(Region)],geom="bar",stat="identity")

plot of chunk unnamed-chunk-10

Chart anatomy

# Chart backbone
ggplot(data=sales[order(Region)],aes(x=year(OrderDate),y=Value,fill=Region))+
  #bar chart type, identity uses contents not count
  geom_bar(stat="identity")+
  #themeing
  scale_fill_brewer(type="qual")+
  theme_minimal()

plot of chunk unnamed-chunk-11

Making a trellis

ggplot(data=sales,aes(x=year(OrderDate),y=Value,fill=Region))+
  geom_bar(stat="identity")+
  # Trellis command
  facet_grid(.~Region)+
  theme(legend.position="none",axis.text.x=element_text(angle=45))

plot of chunk unnamed-chunk-12

Formatting

library(scales)
ggplot(data=sales,aes(x=year(OrderDate),y=Value/1000000,fill=Region))+
  geom_bar(stat="identity")+
  facet_grid(.~Region)+
  theme_minimal(base_size = 10) +
  theme(legend.position="none",axis.text.x=element_text(angle=45,hjust=1))+
  scale_y_continuous(label=dollar)+
  labs(x="Year",y="Value ($m)")

plot of chunk unnamed-chunk-13

Interactive documents / reports

  • Use knitr with LaTeX or markdown to produce static documents
  • Use shiny to make properly interactive reports

Shiny overview

A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required

Uses magic and pixie dust to transform your R code into a webpage that provides end-users with the ability to analyse data interactively

Shiny paradigm

It's call and response!

  • You have client-side controllers that tell server-side code to recalculate content
  • There are R functions for input controls, preparing content, and functions that output content.

Shiny Server

You need a Linux server for this and your applications can be made internally or externally available.

Shiny is by default open, use either a password system or utilise AD integration in the pro edition

Demo time!

library(shiny)
system.file("examples", package="shiny")
runExample("03_reactivity") # a reactive expression
runExample("05_sliders") # sliders

Dynamic documents

CONCEPT: modular documents that contain text and analysis that update

This is pretty much impossible in the Office paradigm.

  • You can make refreshable Excel spreadsheets but stuff with words as well is hard to update
  • It's really hard to collaborate on a single Word document
  • PowerPoint requires a lot of clicking and C&Ping

R integrates with LaTeX, markdown, and pandoc to enable the blending of code and content to enable documents to be produced from scratch each time.

What are the benefits?

  • Change a bit of the code and the whole document updates
  • Modular development enables easier development (particularly in a collaborative setting)
  • Reduction in errors caused by C&Ping
  • Consistent looks and feel of documents and reports
  • Documented as you go
  • No stress about updating a document
  • Variety of export formats

How's it done?

  • You can write LaTeX and markdown (think wikipedia and SO) outside of R
  • There are packages like knitr designed to allow you to embed your R analysis in a variety of other languages
  • Work in Rstudio to benefit from their full integration of these packages

Demo time!

Find out more

Q&A

Conclusion