DAS Bereichsmeeting: Introduction to R for Data Science

Verena Haunschmid
16.02.2017

Content

  • Short overview: Regression courses on coursera
  • Introduction to Data Science with R
    • Data handling with dplyr
    • Data visualisation with ggplot2
    • Model training and evaluation with caret
    • Presentation of results
  • Getting started with R

Online regression courses

Data Science with R

  • Base R can be extended by lots of great libraries (=packages)
  • It provides the necessary tools for all important tasks required for data science
    • Import & tidy (=clean) data
    • Transform, visualise and model data
    • Communicate the results
  • This presentation will focus on the highlighted tasks!

Importing data

  • There are tools for reading data from almost any source …
    • read.xlsx(), read.csv(), …
  • Currently most useful is dplyr for accessing databases
conn <- RSQLServer::src_sqlserver("tauranga", database = "AdventureWorks2012")

emp_data <- conn %>% tbl("vEmployee")

dept_data <- conn %>% tbl("vEmployeeDepartment") %>%
  select(BusinessEntityID, Department, GroupName, StartDate)

dept_data %>% top_n(5) %>% collect()
# A tibble: 5 × 4
  BusinessEntityID Department                            GroupName
*            <int>      <chr>                                <chr>
1              234  Executive Executive General and Administration
2              286      Sales                  Sales and Marketing
3              288      Sales                  Sales and Marketing
4              285      Sales                  Sales and Marketing
5              284      Sales                  Sales and Marketing
# ... with 1 more variables: StartDate <chr>

Importing data

jobs_per_region <- emp_data %>%
  left_join(dept_data, by="BusinessEntityID") %>%
  group_by(Department, CountryRegionName) %>% summarise(count=n()) %>% collect()

jobs_per_region
Source: local data frame [21 x 3]
Groups: Department [16]

                   Department CountryRegionName count
*                       <chr>             <chr> <int>
1                       Sales         Australia     1
2                       Sales            Canada     2
3                       Sales            France     1
4                       Sales           Germany     1
5                       Sales    United Kingdom     1
6            Document Control     United States     5
7                 Engineering     United States     6
8                   Executive     United States     2
9  Facilities and Maintenance     United States     7
10                    Finance     United States    10
# ... with 11 more rows

Visualising data

Recommended package: ggplot2 (grammar of graphics)

ggplot(house_prices, aes(x=YearBuilt, y=SalePrice, col=OverallQual)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot2 extensions

Many great extensions for ggplot2, e.g.: ggTimeSeries, gganimate. I frequently use ggpairs:

ggpairs(house_prices[,c("YearBuilt", "OverallQual", "SalePrice")])

plot of chunk unnamed-chunk-7

Model training

caret package: Classification and Regression Training

Tools for

  • data splitting
  • preprocessing
  • feature selection
  • model tuning
  • cross validation

233 model classes available for usage with the caret framework, e.g.:

  • adaboost
  • linear models, generalized linear models
  • naive bayes
  • partial least squares
  • decision trees, random forests, …

Model training: An example

cvSplits <- createFolds(house_prices$SalePrice, k = 10, returnTrain = TRUE )

lmFit <- train(SalePrice ~ YearBuilt + OverallQual, house_prices, method="lm", trControl=trainControl(index=cvSplits))
pred <- predict(lmFit)
ggplot(data.frame(y=house_prices$SalePrice, pred=pred), aes(x=pred, y=y)) + geom_point()

plot of chunk unnamed-chunk-10

Presentation of results

  • rmarkdown reports
    • Hint: this presentation was done within RStudio!
    • \( \LaTeX\ \) can be used for equations
    • css can be used for styling
  • R Notebooks
    • rmarkdown syntax
  • shiny apps: interactive web applications
    • written in R
    • can be extended by: HTML, css, javascript, …
    • several options for hosting
    • gallery of shiny apps

Summary

  • Loading data: many options

    • Loading data from databases: dplyr
    • Great flexibility w.r.t querying data.frames
  • Visualising data: ggplot2

    • steep learning curve, but very powerful
  • Training models: caret

  • Reporting results: rmarkdown, R notebooks, shiny

Getting started with R

Thank you for your attention! Questions? Remarks?