Jason Freels
16 December 2016
How to install the R toolchain for data science
The R project for statistical computing
The RStudio integrated development environment (IDE)
The Rtools suite of compilers (optional)
Required for building R packages containing compiled code
Required for installing R packages from GitHub containing compiled code
Git/GitHub (optional)
Easily collaborate on Rprojects with other programmers
Seamlessly integrates with RStudio
The basics of R
What R is and how R differs from many other languages
Advantages and disadvantages of using R
Creating and manipulating R objects
Accessing help resources
Literate programming with Rmarkdown
code, text, images, and An open-source, interpreted computer language and suite of statistical operators for calculations on vectors and matrices
An integrated collection of tools and graphical facilities for data analysis and reproducible research
A well-developed programming language which includes conditionals, loops, user-defined recursive functions, and input-output facilities
One of the fastest growing technical programming languages in the world
The S programming language
Computing language originally developed at Bell Labs (1976) for data analysis
Licensed by AT&T/Lucent to Insightful Corp. under the product name: S-Plus.
The R programming language
Written by two statisticians Ross Ihaka and Robert Gentleman at the University of Auckland
Released as an open source implementation of the S language (R plays on name "S")
Since 1997: international R-core team (~15 people) and thousands of code writers and statisticians share their work via R packages
R packages work like apps for smartphones - making it more useful
Currently, 9683 packages are available on the Comprehensive R Archive Network (CRAN)
Methodical updates (annually/bi-annually)
Consistent syntax/user experience across all functions
Single learning curve to becoming proficient
Strategic improvements, carefully implemented
Won't alienate legacy users with drastic changes
Newest methods may not be available for a while
Fast updates - newest capabilities are released every day
Strategic improvements to R-Core
Package authors are independent - proceed in their own directions
Flexible syntax - often NOT consistent
Multiple learning curves for different packages
Legacy users often frustrated with fast-paced changes
Only open-source language with a standardized set of development tools to extend the environment
Gives full access to the programming environment, easily see what functions are doing, fix bugs, and extend software
Promotes reproducible research by providing open and accessible tools for data analysis and reporting
Provides a forum allowing researchers to explore and expand the methods used to analyze data
Scientists around the world are the co-owners to the software tools needed to carry out research
The product of thousands of experts in many fields - R is CUTTING EDGE
Many percieve the R learning curve to be steep, minimal GUI
Addressed by RStudio
In reality, R's learning curve is no steeper than that of any other language
R has a
Little commercial support; new users complain of hostility on R help sites
Figuring out correct methods to use a function on your own can be frustrating
Working with large datasets is limited by RAM
Data prep can be messier and more mistake prone in R vs. SPSS or SAS
R is a fast language that has been
Recall, R was written by statisticians - for statisticians
Many languages are designed to produce numerical results as fast as possible
Compiled languages like C, C++, and FORTRAN can produce numerical results extremely fast (about 30-60x faster than R)
However, compiled code can be difficult to for statisticians and non-developers work with
R was built to help non-developers produce results and high quality graphics fast
R was initially created as a functional programming language but has since taken on attributes of an object-oriented programming
To understand computations in R, remember these five things
Functions, vectors, datasets, character strings... are stored as objects
Graphics are written out and are NOT stored as objects
Objects are classified by two criteria:
MODE: how objects are stored in R - character, numeric, logical, factor, list, & function
CLASS: how objects are treated by functions - vector,list, matrix, array, data frame & hundreds of other classes created by specific functions
Scripting is the process of matching objects and function calls such that the desired output objects (e.g., statistical results) and graphics are returned.
R has three methods to assign names to objects
Left assignment <-
Left deep assignment <<-
Right assigment ->
Right deep assignment ->>
Equals sign =
In most cases R treats each of these assignment methods the same, however the R style guide provide a best practice to ensure consistency
Use <- and -> to define objects (i.e. x <- 4)
Use = to assign values to function arguments (i.e. sqrt(x = 4))
The use of <<- and ->> should be minimized as it can have unintended side effects (if you don't know what is does don't use it)
Object naming rules
Variable names can only contain letters and numbers separated by "." or "_"
Variable Names CANNOT begin with a number
var <- 5 ### Left assignment200 -> Var ### Right assignmentmeaning.of_life = 42 ### Or equal signvar; Var; meaning.of_life[1] 5
[1] 200
[1] 42
mode(object.name) ### Returns an object's mode class(object.name) ### Returns an object's classattributes(object.name) ### Returns any attributes associated with an objectstr(object.name) ### Returns a complete list of properties assigned to an objectErrors can result if you accidentally overwrite an object in the current environment
These functions are helpful for managing objects defined in the current working environment
ls( ) ### Returns a list of active objects in the current working environmentrm( ) ### Removes an object from the current working environmentrm(list = ls()) ### Removes all objects from the current working environment (use carefully)Errors can result if a script, saved in one file location, calls a file in another location
These functions are helpful for managing your local file structure
Additionally, creating projects in RStudio can help avoid these errors
getwd( ) ### Returns the location of the current working directorysetwd("C:/Users/Desktop") ### Reassigns the location of the working directoryfile.choose( ) ### Opens a new file explorer window to choose a file?cos ### Searches the local package library for "cos"??xyz ### Searches the full R documentation for "xyz" - also help(xyz)vignette() ### Lists "how-to" demos available for each package in the libraryhelp.search("t.test") ### Provides a categorized search of R documentation for "t.test"Quick R http://www.statmethods.net/
R-bloggers http://www.r-bloggers.com/
StackOverflow http://stackoverflow.com/questions/tagged/r/