Lab 1: Introduction to R

Learning objectives:

setting a working directory (use R Studio projects)
loading data - one of many ways
data frames and variables
using a variable from a data frame
changing variable type
installing R packages
plotting a line graph

1. Setting a working directory

Let’s start a new project. We can call it “ba” for business analytics. Click on “Project” icon in the upper right corner of R Studio. Select ‘New Project’ a ‘New directory’ choose a folder on your computer where you would like to keep your file for this class.

2. Loading data

Let’s open a new R Markdown document. R Markdown combines text and R code. R code is enclosed in and marks. For example, if we want to read in some data we type the following:

To execute the R code (rather than knit the entire R Markdown document), we put the cursor on the line we want to execute and either click run, or hold ‘control’ key and press ‘enter’.

Function read.csv() reads comma separated files. It takes the location of that file as an argument. The location can be a path on your hard drive or a URL. In this case the location was URL of a file is on Yahoo!’s website.

The <- operator assigns the result of the read.csv() function to an object we named ‘mydata’. The result of the read.csv() function is an object called data frame. You can examine this data frame by clicking on its name in the Environment window.

3. Data frames and variables

Data frame is an object in R that holds variables and observations. Let’s examine the structure of our data frame ‘mydata’. We can do this by executing str() function. (We can do this by either entering the function into the console, or if we want it part of the markdown document, entering it and executing it within the markdown document.)

str(mydata)

## 'data.frame':    16862 obs. of  7 variables:
##  $ Date     : Factor w/ 16862 levels "1950-01-03","1950-01-04",..: 16862 16861 16860 16859 16858 16857 16856 16855 16854 16853 ...
##  $ Open     : num  2268 2262 2252 2252 2250 ...
##  $ High     : num  2272 2273 2264 2254 2255 ...
##  $ Low      : num  2260 2262 2245 2234 2245 ...
##  $ Close    : num  2269 2271 2258 2239 2249 ...
##  $ Volume   : num  3.76e+09 3.76e+09 3.77e+09 2.67e+09 2.34e+09 ...
##  $ Adj.Close: num  2269 2271 2258 2239 2249 ...

The results tell us that we have over 16 thousand observations and 7 variables. It tells us the variable names, their type and the first few observations. We see that variables Open, High, Low etc. are numerical (‘num’). Date is a Factor. Factors are variables that usually take on limited number of values. These limited values are called factor levels. Factor variables are stored as numbers but displayed with character values.

4. Using a variable from a data frame

If we want to do something with a particular variable we write the name of the data frame, a ‘$’ sign, and the name of the variable. For example, below we calculate the summary of the variable Close in data frame mydata:

summary(mydata$Close)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.66   84.50  156.70  508.70 1008.00 2272.00

5. Changing variable type

We can change the type of a variable. For example, in this case we may want to change the Date variable from a ‘Factor’ type to a ‘Date’ type. In other words we would like R to understand that Date is not just any old set of characters but rather a calendar date. We can do this using the as.Date() function. The function takes as arguments the variable we would like to convert, and the format of that variable. We assign the result to variable Date in data frame mydata overwriting the existing variable.

mydata$Date <- as.Date(mydata$Date, "%Y-%m-%d")

Let’s check the type of our variables:

str(mydata$Date)

##  Date[1:16862], format: "2017-01-05" "2017-01-04" "2017-01-03" "2016-12-30" ...

We can also run a summary on this variable so we know when our data begin and end.

summary(mydata$Date)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1950-01-03" "1966-10-07" "1983-07-30" "1983-07-18" "2000-04-03" 
##         Max. 
## "2017-01-05"

IN-CLASS EXERCISE

Load in data on the results of the 2016 NHL season from https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1
How many games were there?
What was the average attendance?
Did home teams score on average more goals than visiting teams?

6. Installing R packages

R packages are add-ons to base R. They have to be installed in order for you to use them. You can install them by clicking on the ‘Packages’ tab in the bottom right panel of R Studio, and clicking on ‘Install Packages’ button. You can select a package and click the ‘Install’ button.

The packages that are particularly useful for this class are dplyr, tidyr, lubridate, ggplot2, stargazer. You should install these on your computer. You only need to install a package once. However, you need to ‘load’ the package into each new R session using the library() command. For example, if we plan to use commands from the ggplot2 package we should include the following in our code:

library(ggplot2)

7. Plotting data

Let’s plot the closing value of the S&P 500 in our data set. We use a powerful function called ggplot(). The function creates a plot by combining a few components. First, it needs to know the data frame from which to get the data. Second, it needs to know which variables should be on the x and y axes. This is specified in aesthetics(). Finally, it needs to know the geometric object (geom) we want to use to represent the data (in this case a line).

ggplot(data=mydata, aes(x=Date, y=Close)) + geom_line()

IN-CLASS EXERCISE

Plot the density of attendance.

Exercises

Write a new R Markdown file that does the analysis and answers the questions below. Knit your R Markdown, print it and bring to class.

Download data from Yahoo! Finance on closing values of NASDAQ Composite Index (hint: the code is ^IXIC)
How many observations are available for the NASDAQ Composite?
Convert Date from a Factor into a Date type variable.
Plot the closing values NASDAQ Composite Index. Use this documentation for ggplot2 to add title to your plot.
Use the same documentation to create a new plot that uses logarithmic scale on the y axis. Add pretty labels on the y axis (‘breaks’ in ggplot terminology). E.g. 100,200,400 etc.
Explain why we see different patterns on the linear versus logarithmic scale plots. (Hint 1: Noting that one is on a logarithmic scale and one is on a linear scale is not a sufficient answer. Hint 2: Think about what the vertical distances between points represent on each of the graphs.)
What are your overall conclusions from these two plots? In other words, what are the key things you feel can be learned about the behavior of the NASDAQ over this time frame?