Thursday, July 03, 2014

Agenda

  • Introduction (5 mins)
  • Quick overview of R (5 mins)
  • Getting R running on a desktop (5 mins)
  • Connecting to SQL Server (5 mins)
  • Basic analysis with data.table and ggplot2 (15 mins)
  • A whistlestop intro to one of:
    • Building interactive reports
    • Dynamically built documents
  • Q&A & Conclusion (10 mins)

Introduction

Objective

  • Cover what R is, how you can use it and show you some of it's capabilities that you're most interested in

What's your background?

Arduous puns

  • I'm giving away a copy of Tribal SQL to the best question asker who can do a good 'R' or R pun

Questions

  • Can be asked throughout, but keep substantial questions to the end to ensure we can cover the topics people want to see.

About Steph Locke aka Me!

  • Lives and works in Cardiff
  • Works as BI & Risk Analytics Developer
    • Data warehouse
    • Reporting
    • Training
    • Modelling
    • Development
  • Works with a wide array of technologies
  • Co-author of Tribal SQL
  • Cardiff User Group leader
  • SQLRelay organiser

About R

R is a programming language for analysing data. Fundamentally, R allows you to directly code analysis, reports, and statistical models.

Why is that a good thing?

  • Code is syntax – easy to apply principals to new language
  • Time reduction – less clicking means more time thinking
  • Standards – unit testing, code review, source control etc

Great, so what else?

  • It's free and totally customisable
  • It works on Windows and with SQL Server
  • It looks awesome, so you look awesome

Some R niggles

Like any technology there are some current quirks or limitations

  • Based on S which was written in the 70s
  • For a long time, was the sole province of academic statisticians
  • It's open so there are competing coding paradigms (also a benefit!)
  • It can take a while to get the hang of googling for R
  • No robust MDX package as yet

Installing R

Notes on installing R

  • R can typically be installed without UAC being triggered
  • R uses packages to extend itself so you need to be able to download zip files
  • Rstudio will update on a relatively regularly basis and does hit UAC
  • R and Rstudio work better on linux as it's easier to get the additional components that can be required for some packages
  • I would also recommend the installation of
    • MikTex (Windows variant of LaTeX) and Pandoc for documentation
    • Git for source control (integrates with TFS!)

Working with SQL Server

First things first - get a SQL Server driver

Secondly - install the RODBC package for R and activate it

install.packages("RODBC",repos="http://cran.ma.imperial.ac.uk/")
## Installing package into 'C:/Users/OzandSteph/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'RODBC' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\OzandSteph\AppData\Local\Temp\Rtmp8OgVHl\downloaded_packages
# Activate the ODBC package for use
library("RODBC")

Thirdly - connect and grab some data!

# Connection courtesy of Redgate bit.ly/1jrl7V6

# Build a connection to the DB for reuse
azure <- odbcDriverConnect(
          "Driver={SQL Server Native Client 11.0};
          Server=mhknbn2kdz.database.windows.net;
          Database=AdventureWorks2012;
          Uid=sqlfamily;
          Pwd=sqlf@m1ly;")

# Store simple SELECT statement results
person <- sqlQuery(azure, 
         "SELECT TOP 5 * FROM [Person].[Person]")

Inline example: 5 records returned on previous page and details below

#Show results
person
##   BusinessEntityID PersonType NameStyle Title FirstName MiddleName
## 1                1         EM         0  <NA>       Ken          J
## 2                2         EM         0  <NA>     Terri        Lee
## 3                3         EM         0  <NA>   Roberto       <NA>
## 4                4         EM         0  <NA>       Rob       <NA>
## 5                5         EM         0   Ms.      Gail          A
##     LastName Suffix EmailPromotion AdditionalContactInfo
## 1    Sánchez     NA              0                    NA
## 2      Duffy     NA              1                    NA
## 3 Tamburello     NA              0                    NA
## 4    Walters     NA              0                    NA
## 5   Erickson     NA              0                    NA
##                                                                                                                                                          Demographics
## 1 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 2 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 3 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 4 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
## 5 <IndividualSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"><TotalPurchaseYTD>0</TotalPurchaseYTD></IndividualSurvey>
##                                rowguid ModifiedDate
## 1 92C4279F-1207-48A3-8448-4636514EB7E2   2003-02-08
## 2 D8763459-8AA8-47CC-AFF7-C9079AF79033   2002-02-24
## 3 E1A2555E-0828-434B-A33B-6F38136A37DE   2001-12-05
## 4 F2D7CE06-38B3-4357-805B-F4B6B71C01FF   2001-12-29
## 5 F3A3F6B4-AE3B-430C-A754-9F2231BA6FEF   2002-01-30

Notes on connecting

  • You can use DSNs, I've just done it explicitly for proof
  • You can write functions to hide the different steps to make it quicker in future
  • Multiple statements can be troublesome - try to keep to just one

Basic analysis - data.table & ggplot2

data.table and ggplot2 are probably the two most useful packages in R!

data.table

  • data.table enhances data.frames (tables) to give you grouping, filtering, joins, aggregations and column names.
  • It's quite SQL-like and is designed to be very fast.

ggplot2

  • ggplot2 implements a grammar of graphics (gg) to allow you to grammatically build plots.
  • It is a set of statements that allow you to build awesome charts

data.table

Grab some data

# SQL statement
sales <- sqlQuery(azure, "
SELECT 
soh.OrderDate,
t.Name as Territory,
cr.countryregioncode as Code,
cr.Name as Region,
sum(TotalDue) as Value,
count(*) as Volume
FROM [Sales].[SalesOrderHeader] soh
INNER JOIN [Sales].[SalesTerritory] t 
          on t.territoryid=soh.territoryid
INNER JOIN [Person].[CountryRegion] cr 
          on cr.countryregioncode=t.countryregioncode
WHERE  soh.OrderDate < '2008-07-01'
GROUP BY soh.OrderDate,
t.Name ,
cr.Name,
cr.countryregioncode")

Basics / Prep

# Convert to data.table
sales<-data.table(sales)

# Basic GROUP BY aggregation - returns count by region
sales[,.N,by=Region]
##            Region    N
## 1:  United States 1822
## 2:         Canada  607
## 3:        Germany  710
## 4:         France  695
## 5:      Australia 1027
## 6: United Kingdom  758

Largest Order per Region

# .SD returns the subset of rows and is filtered by which.max
kable(sales[,.SD[which.max(Value)],by=Region])
Region OrderDate Territory Code Value Volume
United States 2006-09-01 Southwest US 1158625 36
Canada 2007-12-01 Canada CA 803033 37
Germany 2007-08-01 Germany DE 297738 16
France 2007-08-01 France FR 803118 25
Australia 2007-09-01 Australia AU 311014 21
United Kingdom 2007-09-01 United Kingdom GB 561617 15

Totals by year

# ORDER, multiple aggregations, and aliasing shown
kable(sales[order(OrderDate),list(Value=sum(Value),Volume=sum(Volume))
            ,by=list(Year=year(OrderDate))])
Year Value Volume
2005 12693251 1379
2006 34463848 3692
2007 47171490 12443
2008 28832019 12975

ggplot2

Basics / Prep

# Activate ggplot2
library(ggplot2)

# Summarise sales data
sales<-sales[,list(Value=sum(Value),Volume=sum(Volume)),
             by=list(Region,Year=year(OrderDate))]

# Quick plot (qplot)
qplot(Year, Value, fill=Region, data=sales,
      facets=Region~. ,geom="bar", stat="identity")

plot of chunk unnamed-chunk-9

Simple anatomy

# Chart backbone
ggplot(data=sales,aes(x=Year,y=Value,fill=Region))+
  #bar chart type, identity uses contents not count
  geom_bar(stat="identity")

plot of chunk unnamed-chunk-10

Modifying geom_bar

ggplot(data=sales,aes(x=Year,y=Value,fill=Region))+
  #use the position argument
  geom_bar(stat="identity", position="dodge")

plot of chunk unnamed-chunk-11

Make a trellis chart

ggplot(data=sales,aes(x=Year,y=Value,fill=Region))+
  geom_bar(stat="identity")+
  # Trellis command
  facet_grid(Region~.)

plot of chunk unnamed-chunk-12

Formatting

library(scales)
ggplot(data=sales,aes(x=Year,y=Value/1000000))+
  geom_bar(stat="identity", fill="grey")+
  facet_grid(Region~.)+
  theme_minimal() +
  theme(legend.position="none")+
  scale_y_continuous(label=dollar)+
  ylab("Value ($m)")

plot of chunk unnamed-chunk-13

Interactive documents / reports

  • Use knitr with LaTeX or markdown to produce static documents
  • Use shiny to make properly interactive reports

Shiny overview

A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required

Uses magic and pixie dust to transform your R code into a webpage that provides an interactive version

Shiny paradigm

It's call and response!

  • You have client-side controllers that tell server-side code to recalculate content
  • There are R functions for input controls, preparing content, and functions that output content.

Shiny Server

You need a linux server for this and your applications can be made internally or externally available.

Shiny is by default open, use either a password system or utilise AD integration in the pro edition

Demo time!

library(shiny)
system.file("examples", package="shiny")
runExample("03_reactivity") # a reactive expression

Dynamic documents

CONCEPT: modular documents that contain text and analysis that update

This is pretty much impossible in the Office paradigm.

  • You can make refreshable Excel spreadsheets but stuff with words as well is hard to update
  • It's really hard to collaborate on a single Word document
  • PowerPoint requires a lot of clicking and C&Ping

R integrates with LaTeX, markdown, and pandoc to enable the blending of code and content to enable documents to be produced from scratch each time.

What are the benefits?

  • Change a bit of the code and the whole document updates
  • Modular development enables easier development (particularly in a collaborative setting)
  • Reduction in errors caused by C&Ping
  • Consistent looks and feel of documents and reports
  • Documented as you go
  • No stress about updating a document
  • Variety of export formats

How's it done?

  • You can write LaTeX and markdown (think wikipedia and SO) outside of R.
  • There are packages like knitr designed to allow you to knit R into a variety of other languages.
  • Work in Rstudio to benefit from their full itnegration of these products

Demo time!

Find out more

Q&A

Conclusion