Verena Haunschmid
16.02.2017
dplyrggplot2caretAdvanced Linear Models for Data Science 1: Least Squares
read.xlsx(), read.csv(), …dplyr for accessing databasesconn <- RSQLServer::src_sqlserver("tauranga", database = "AdventureWorks2012")
emp_data <- conn %>% tbl("vEmployee")
dept_data <- conn %>% tbl("vEmployeeDepartment") %>%
select(BusinessEntityID, Department, GroupName, StartDate)
dept_data %>% top_n(5) %>% collect()
# A tibble: 5 × 4
BusinessEntityID Department GroupName
* <int> <chr> <chr>
1 234 Executive Executive General and Administration
2 286 Sales Sales and Marketing
3 288 Sales Sales and Marketing
4 285 Sales Sales and Marketing
5 284 Sales Sales and Marketing
# ... with 1 more variables: StartDate <chr>
jobs_per_region <- emp_data %>%
left_join(dept_data, by="BusinessEntityID") %>%
group_by(Department, CountryRegionName) %>% summarise(count=n()) %>% collect()
jobs_per_region
Source: local data frame [21 x 3]
Groups: Department [16]
Department CountryRegionName count
* <chr> <chr> <int>
1 Sales Australia 1
2 Sales Canada 2
3 Sales France 1
4 Sales Germany 1
5 Sales United Kingdom 1
6 Document Control United States 5
7 Engineering United States 6
8 Executive United States 2
9 Facilities and Maintenance United States 7
10 Finance United States 10
# ... with 11 more rows
Recommended package: ggplot2 (grammar of graphics)
ggplot(house_prices, aes(x=YearBuilt, y=SalePrice, col=OverallQual)) + geom_point()
Many great extensions for ggplot2, e.g.: ggTimeSeries, gganimate. I frequently use ggpairs:
ggpairs(house_prices[,c("YearBuilt", "OverallQual", "SalePrice")])
caret package: Classification and Regression Training
Tools for
233 model classes available for usage with the caret framework, e.g.:
cvSplits <- createFolds(house_prices$SalePrice, k = 10, returnTrain = TRUE )
lmFit <- train(SalePrice ~ YearBuilt + OverallQual, house_prices, method="lm", trControl=trainControl(index=cvSplits))
pred <- predict(lmFit)
ggplot(data.frame(y=house_prices$SalePrice, pred=pred), aes(x=pred, y=y)) + geom_point()
rmarkdown reports
shiny apps: interactive web applications
Loading data: many options
dplyrVisualising data: ggplot2
Training models: caret
Reporting results: rmarkdown, R notebooks, shiny
Thank you for your attention! Questions? Remarks?