Instructor Details

Matthew Dixon

Email: mdixon7@stuart.iit.edu

TA: Bo Wang

Email: bwang54@hawk.iit.edu

Feedback

On completion of this first session, please complete the following anonymous survey. Your detailed feedback helps us to improve your bootcamp experience and is taken very seriously. Thanks!

Introduction

In this bootcamp, we will be using the R language heavily in the notes, examples and lab exercises. R is free and you can install it like any other program on your computer.

  1. Go to the CRAN website and download it for your Mac or PC.

  2. Install the free version of the RStudio Desktop Software.

RStudio makes it very easy to learn and use R, providing a number of useful features that many find indispensable.

Objectives for this session

  • Learn the RStudio interface.
  • Learn how to navigate the Help tab.
  • Learn what packages are and how to install them (with examples in financial and marketing analytics).
  • Learn how to manage the workspace, write and save R scripts.
  • Learn how to import, export, and view text files.
  • Learn how to summarize and visualize data.
  • Navigate additional continued learning resources.

In the next session, we will learn how to program in R.

About the R language and eco-system

R is both a programming language and environment for statistical computing. R has grown to somewhere between 1M and 2M users worldwide and is used in academia and in industry. It can be used for extracting data from tens of thousands of different online data sources, exploratory data analysis, statistical modeling, reporting and interactive data visualization. You could even build a mobile app and website for powerful data analysis and visualization with R.

There are R packages for sentiment analysis of social media, cryptocurrency market analysis, high frequency trading data analysis etc. There are even R packages which will run stata code and SPSS datasets. There are also vibrant academic and industry research communities publishing research using R such as R in Finance in Chicago. If you are interested in contributing to the R eco-system as a developer and researcher, you can apply to become a Google Summer of Code paid intern with me this summer.

Using R

If you are used to traditional computing languages, you will find R different in many ways. The basic ideas behind R date back four decades and have a strong flavor of exploration: one can grapple with data, understand its structure, visualize it, summarize it etc. Therefore, a common way people use R is by typing a command and immediately see the results. (Of course, scripts can also be written and fed to R for batch execution.)

The core of R itself is reasonably small, but over time, it has also become a vehicle for researchers to disseminate new tools and methodologies via packages. That is one reason for R’s popularity: there are thousands of packages (10,300+ as of this writing) that extend R in many useful ways.

The CRAN website is something you will consult frequently for both the software, documentation and packages others have developed.

RStudio

We can only cover some important aspects of RStudio here today. There are a number of resources online, including Youtube videos, Data Camp and many others that you can consult outside of class. Please see also the list of resources at the bottom of these notes for further links.

When you start RStudio, your screen should look something like:

One can type commands directly into the console window and see results. For example, go ahead and type 1+1 to use R as a calculator and see the result. However, one often wants to write a sequence of commands, execute them and possibly save the commands to run them again another time. That’s what the editor window is for. You can type a series of commands into the editor window and RStudio will offer to save them when you quit, and bring them back when you restart RStudio.

If you type

x = 1 + 1
y = 2 * x
z <- (x + y)

into the editor window, you can press the Run arrow shown and execute each line in the R console, one by one. The figure below shows this and as new variables are created, the workspace panel displays them.

Should I use = or <- for assignment?

In R, both = and <- can be used for assigning a value to variables. Strictly <- is preferred for technical reasons.

Help

Help is available in RStudio through the help tab that you should feel free to investigate.

Getting Help

R has extensive documentation.

Type help(<item>) or ?<item> (where is a place-holder) in the Console window to bring up a help page. Results will appear in the Help tab in the lower-right hand window. Certain functions may require quotations, such as help("+"). Try the following

# Example

help("base") 

# or

?base

# or

help("+")

Packages

Packages are collections of additional functions that can be loaded on demand. They commonly include example data that can be used to demonstrate those functions. Although R comes with many common statistical functions and models, most of our work requires additional packages.

Installing Packages

When anyone installs R, there is a set of recommended packages that is always installed. So your installed packages will reflect that. As we proceed, you will have to install many packages and that list will, of course, grow.

There are world-wide R package repositories or Comprehensive R Archive Network (CRAN) sites that allow packages to be downloaded and installed. You almost never have to directly work with them since RStudio makes it easy to install the packages as shown in the figure below, where we have clicked on the Packages tab and clicked on the Install button. Note how as you type the name of a package, you get auto-completion. (In fact, RStudio provides auto-completion even as you type R commands, showing you various options you can use for the commands!). You can install and load packages in R by typing install.packages(“”) and then you must load the package with the library(“”) command.

Let’s view some of the financial and marking analytics capabilities available through various packages. Do not concern yourself with the R syntax, we will be covering most of this next session.

Example 1: Viewing cryto-currency history

Quandl is a source for free and premium financial and economic data including financial market data, housing indices, federal reserve rates, company fundamentals. We can download data via the Quandl RESTful API service and access it directly in R. Let’s collect the history of the Global Digital Asset Exchange (GDAX) rate between Ethereum (a crypto-currency) and the Euro. We will plot the history using the ggplot package.

install.packages("Quandl")
install.packages("ggplot2")
if(!require("Quandl"))
  stop('you need to install Quandl first by typing install.packages("Quandl")')
if(!require("ggplot2"))
  stop('you need to install ggplot2 first by typing install.packages("ggplot2")')
mydata <- Quandl("GDAX/ETH_EUR", start_date="2017-7-02")
ggplot(mydata, aes(Date, Open)) + geom_line()  + xlab("") + ylab("Open Price")

Example 2: Viewing a Brand attribute correlation matrix

Let’s load the results of a market survey of various fictitious consumer brands. Each brand is scored across various attributes (or categories) on a scale of 1-10. Let’s view the correlation matrix of brand attribute scores across all brands. The package corrplot has conveniently grouped together the most similar brand attributes by using a hierarchical clustering algorithm.

install.packages("corrplot")
if(!require("corrplot"))
  stop('you need to install corrplot first by typing install.packages("corrplot")')

brand.ratings<-read.csv('data/brand_ratings.csv')
brand.sc <- brand.ratings
brand.sc[, 1:9]<-scale(brand.sc[, 1:9])
corrplot(cor(brand.sc[, 1:9]), order="hclust")

Example 3: Text Analysis of Social Media

R can access twitter and provides the capability to perform textual data analysis. For fun, let’s analyze President Donald Trump’s tweets and display the most frequently used words in a word cloud. You could of course, analyze any twitter account and perform a financial or marketing study. For example, do the president’s tweets affect the stock market? What impact do twitter ads have on consumer behavior?

install.packages("wordcloud",dependencies=T)
if(!require("wordcloud"))
  stop('you need to install wordcloud first by typing install.packages("wordcloud")')

tweets<-read.csv('data/tweets.csv',header =TRUE,quote= "")
words<-unlist(strsplit(as.character(tweets$text), " "))
words<-tolower(words)
clean_words<-words[-grep("http|@|#|ü|ä|ö|&amp", words)] # remove urls, usernames, hashtags, umlauts (can not be displayed by all fonts), special characters
wordcloud(clean_words, min.freq=50, vfont=c("gothic italian", "plain"))

Workspace

As you use RStudio more, you will find yourself creating variables (like x, y, z above, except far more valuable) and it is desirable to save them. When you quit RStudio, you will be given a choice of saving your workspace. It is worth doing so if you have important things created.

RStudio also a notion of projects and so you can keep project workspaces separate. Each such project can be designated a working folder so that x from one workspace does not clobber x from another. You can explore these options via the File menu.

Later, we will see facilities to selectively save and restore some specified objects in our workspace, but not all of them.

Writing Scripts

RStudio’s Source Tabs serve as a built-in text editor. Prior to excuting R functions at the Console, commands are typically written down (or scripted). Scripting is essentially showing your work. The sequence of functions necessary to complete a task are scripted in order to document or automate a task. While scripting may seems cumbersome at first, it ultimately saves time in the long run, particularly for repetitive tasks. Benefits include:

  • allows others to reproduce your work, which is the foundation of science
  • serves as instruction/reminder on how to perform a task
  • allows rapid iteration, which saves time and allows the evaluation of incremental changes
  • reduces the chance of human error

Basic Tips of Scripting

To write a script, simply open a new R script file by clicking File>New File>R Script. Within the text editor type out a sequence of functions.

  • Place each function (e.g. print('R Bootcamp')) on a separate line.
  • If a function has a long list of arguments, place each argument on a separate line.
  • A command can be excuted from the text editor by placing the cursor on a line and typing Crtl + Enter, or by clicking the Run button.
  • An entire R script file can be excuted by clicking the Source button.

Saving R Files

In R, you can save several types of files to keep track of the work you do. The file types include: workspace, script, history, and graphics. It is important to save often because R, like any other software, may crash periodically. Such problems are especially likely when working with large files. You can save your workspace in R via the command line or the File menu.

R script (.R)

An R script is simply a text file of R commands that you’ve typed.

You may want to save your scripts (whether they were written in R Editor or another program such as Notepad) so that you can reference them in the future, edit them as needed, and keep track of what you’ve done. To save R scripts in RStudio, simply click the save button from your R script tab. Save scripts with the .R extension. R assumes that script files are saved with only that extension. If you are using another text editor, you won’t need to worry about saving your scripts in R. You can open text files in the RStudio text editor, but beware copying and pasting from Word files as discussed below.

To open an R script, click the file icon.

Note on Microsoft Word Files

Using Microsoft Word to write or save R scripts is generally a bad idea. Certain keyboard characters, such as quotations “”, are not stored the same in Word (e.g. “” vs. “”). The difference is largely indistinguishable to the human eye, but will not run in R.

Workspace (.Rdata)

The R workspace consists of all the data objects you’ve created or loaded during your R session. When you quit R by either typing q() or exiting out of the application window, R will prompt you to save your workspace. If you choose yes, R saves a file named .RData to your working directory. The next time you open R and reload your .Rdata workspace, all of your data objects will be available in R and all of the commands that you’ve typed will be accessible by using the up-arrow and down-arrow keys on your keyboard. You can also save or load your workspace at any time during your R session from the menu by clicking Session>Save Workspace As.., or the save button on the Environment Tab.

The R command for saving your workspace is:

save.image(file="workspace.RData")

R history (.Rhistory)

An R history file is a copy of all your key strokes. You can think of it as brute force way of saving your work. It can be useful if you didn’t document all your steps in an R script file. Like an R file, an Rhistory file is simply a text file that lists all of the commands that you’ve executed. It does not keep a record of the results. To load or save your R history from the History Tab click the Open File or Save button. If you load an Rhistory file, your previous commands will again become available with the up-arrow and down-arrow keys.

Datasets

R comes with many datasets built in. These are part of the datasets package that is always loaded in R. For example, the mtcars dataset is a well-known dataset from Motor Trend magazine, documenting fuel consumption and vehicle characteristics for a number of vehicles. At the R console, typing mtcars will print the entire dataset.

You can find help on datasets as usual using the Help tab in RStudio, clicking on the Packages link and navigating to the datasets package.

Import data

To do any real work, one has to load data from an external source. RStudio provides numerous ways of importing data directly from spreadsheets like Excel, text files. These files can be on your hard-drive or located on a remote server, accessible through a url.

Consider the data set that will be used in Lab 2, which is the 100m times for men and women. We will illustrate importing this data set, step by step.

Step 1

From the Import Dataset menu, select From CSV to get a dialog as shown below and navigate to the folder containing the 100men file.

Note that the import dialog has a number of options and on the right buttom it shows a preview of the code that will be used to import the data. If one cut and pasted the code into the R console, the result would be the same as what one would get via the dialogs.

By default, RStudio also take care to name the variable that will hold data according to R conventions using Xmen100

Step 2

When you open the file, RStudio shows a preview of the data in the viewer window.

This is of course not what we want since a cursory inspection shows that the data appears to contain three columns. So obviously, we have specified something wrong.

Step 3

In the Import Options panel, change the delimeter to Tab and while we are at it, change the name to data.men. Notice how the code preview reflects changes made to these options.

Step 4

Press the Import button to get the data into R.

The result of the import is a variable called data.men that contains the data. Data formatted this way (either tab-delimeted, or comma-separated, or spread-sheet like) is so common that R has a abstraction for it: the data frame. You will have more opportunity to learn about data frames in the data parts of the course.

Avoiding dialogs

As one becomes more and more familiar with R, direct code becomes preferable to the slower interactive dialogs. This is one reason that RStudio gives you the code preview, to aid in your learning process.

RStudio makes use of some packages to import data, notably the readr package. Strictly speaking these packages are not necessary for the job, but such packages include improvements that make them attractive. For example, a vanilla installation of R provides functions like read.csv and read.delim (analogous to read_csv, read_delim) that can also be used. However, by default, these functions perform some conversions, treating character variables as factors, for example. That can be troublesome (and computationally expensive) when dealing with large data sets.

So, to get the same effect as the above dialog process did, one could have pasted the RStudio code into an R console to get the same result.

install.packages("readr")
if(!require("readr"))
  stop('you need to install readr first by typing install.packages("readr")')

data.men <- read_delim("data/100men", "\t", escape_double = FALSE, trim_ws = TRUE)

Viewing and Removing Data

Once the file is imported, it is imperative that you check to ensure that R correctly imported your data. Make sure numerical data are correctly imported as numerical, that your column headings are preserved, etc. To view data simply click on the data.men dataset listed in the Environment tab. This will open up a separate window that displays a spreadsheet like view.

Additionally you can use the following functions to view your data in R.

Function Description
print() prints the entire object (avoid with large tables)
head() prints the first 6 lines of your data
str() shows the data structure of an R object
names() lists the column names (i.e., headers) of your data
ls() lists all the R objects in your workspace directory

Try entering the following commands to view the sand dataset in R:

str(data.men)
## Classes 'tbl_df', 'tbl' and 'data.frame':    20 obs. of  4 variables:
##  $ Athlete: chr  "Usain Bolt" "Usain Bolt" "Usain Bolt" "Asafa Powell" ...
##  $ Nation : chr  "Jamaica" "Jamaica" "Jamaica" "Jamaica" ...
##  $ Time   : num  9.58 9.69 9.72 9.74 9.77 9.79 9.84 9.85 9.86 9.9 ...
##  $ Date   : Date, format: "2009-08-16" "2008-08-16" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 4
##   .. ..$ Athlete: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Nation : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Time   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Date   :List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

You will see that the data consists of 20 observations on 3 variables: Athlete, Time, Date. The second is numeric while the others are character.

Now try these commands:

names(data.men)
## [1] "Athlete" "Nation"  "Time"    "Date"
head(data.men)
## # A tibble: 6 x 4
##          Athlete  Nation  Time       Date
##            <chr>   <chr> <dbl>     <date>
## 1     Usain Bolt Jamaica  9.58 2009-08-16
## 2     Usain Bolt Jamaica  9.69 2008-08-16
## 3     Usain Bolt Jamaica  9.72 2008-05-31
## 4   Asafa Powell Jamaica  9.74 2007-09-09
## 5   Asafa Powell Jamaica  9.77 2005-06-14
## 6 Maurice Greene     USA  9.79 1999-06-16
ls(data.men)
## [1] "Athlete" "Date"    "Nation"  "Time"

A data object is anything you’ve created or imported and assigned a name to in R. The Environment tab allows you to see which data objects are in your R session and expand their structure. Right now data.men should be the only data object listed. If you wanted to delete all data objects from your R session, you could click the broom icon from the Environments tab. Otherwise you could type:

# Remove all R objects
rm(list = ls(all = TRUE)) 

# Remove individal objects
rm(data.men)
## Warning in rm(data.men): object 'data.men' not found

Graphs and Plots

Graphing/plotting are among the great strengths of R. There are two mainapproaches that are common in building graphs and plots.

  1. Using basic functions provided by R itself via the graphics package which has a number of standard facilities. A quick way to familiarize yourself with base graphics is to type the command demo(graphics) at the R console to see its capabilities.

  2. Using a package like ggplot2, which requires a more nuanced understanding of a graphics object. You will have to install this package. ggplot2 implements a grammar of graphics and so takes a bit more work to use, but is quite powerful.

Both approaches allow for step-by-step building up of complex plots, and creating PDFs or images that can be included in other documents. Although ggplot2 is becoming more popular, many packages may not use ggplot2 for plotting. Furthermore, some special plots created by packages may use one of base graphics or ggplot2 and so there isn’t a ready made equivalent in the other, although it can be constructed with extra work. So you will see both bae graphics and ggplot2 used in this course.

For ease of use, ggplot2 provides a function called qplot that can emulate the base graphics plot function capabilities. This offers a quick way to begin using ggplot2, initially.

Description Base Graphics ggplot2
Plot y versus x using points plot(x, y) qplot(x, y)
Plot y versus x using lines plot(x, y, type = "l") qplot(x, |y, geom = "line")
Plot y versus x using both points and lines plot(x, y, type = |"b") qplot(x, y, geom = c("point", "line"))
Boxplot of x boxplot(x) qplot(x, geom = "boxplot")
Side-by-side boxplot of x and y boxplot(x, y) qplot(x, y, geom = "boxplot")
Histogram of x hist(x) qplot(x, geom = "histogram")

Examples

It is a good idea to try out the functions using the example function. At the R console type,

example(plot)

to see the plot examples.

For ggplot2, you will have to load the library first and then use example.

library(ggplot2)
example(qplot)

Additional Resources

References