Lab 2: Manipulating Data I

Learning objectives:

  • dplyr package
  • selecting variables using select()
  • renaming variables using rename()
  • selecting observations using filter()
  • sorting data using arrange()
  • creating new variables
  • summarizing and plotting data

1. Dplyr package

Dplyr package is very useful for manipulating data. Make sure you have it installed. Load it using library(dplyr) at the beginning of each R script or markdown files.

Let’s bring up our S&P 500 closing values data and turn the date variable from a factor type into a date type as we did in the previous lab.

library(dplyr)
library(ggplot2)
mydata <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^GSPC")
mydata$Date <- as.Date(mydata$Date, "%Y-%m-%d")

2. Selecting variables

The mydata data frame contains a number of variables we do not intend to use. Let’s use a function select() to select the variables we plan to use. The function takes the name of a data frame as the first argument, and the names of the variables to keep as the following arguments.

mydata <- select(mydata, Date, Close)

3. Renaming variables

Instead of Close, let’s give our S&P500 variable more informative name:

mydata <- rename(mydata, SP500=Close)

4. Sorting data

mydata <- arrange(mydata, Date)
head(mydata)
##         Date SP500
## 1 1950-01-03 16.66
## 2 1950-01-04 16.85
## 3 1950-01-05 16.93
## 4 1950-01-06 16.98
## 5 1950-01-09 17.08
## 6 1950-01-10 17.03

5. Selecting observations

Let’s look at data starting in January 1, 1990. Function filter() keeps only observations that satisfy the condition inside the function.

mydata <- filter(mydata, Date>=as.Date("1990-01-01"))
head(mydata, n=3)
##         Date  SP500
## 1 1990-01-02 359.69
## 2 1990-01-03 358.76
## 3 1990-01-04 355.67

6. Creating new variables

Let’s calculate the daily price returns on the S&P500, i.e. the percentage change in the index from previous day’s closing value to today’s closing value. We first create a new variable that equals previous day’s closing value. We use the function lag() which takes the previous observation.

mydata$SP500_lag <- lag(mydata$SP500)

And then create another new variable called SP500return

mydata$SP500return <- (mydata$SP500-mydata$SP500_lag)/mydata$SP500_lag*100
head(mydata, n=3)
##         Date  SP500 SP500_lag SP500return
## 1 1990-01-02 359.69        NA          NA
## 2 1990-01-03 358.76    359.69  -0.2585538
## 3 1990-01-04 355.67    358.76  -0.8612992

7. Summarizing and plotting data

summary(mydata$SP500return)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -9.03500 -0.46250  0.05105  0.03344  0.56410 11.58000        1

We see the minimum return was around -9%. Let’s find out when this happened by sorting the data by the return and looking at a few observations:

mydata <- arrange(mydata, SP500return)
head(mydata, n=3)
##         Date   SP500 SP500_lag SP500return
## 1 2008-10-15  907.84    998.01   -9.034978
## 2 2008-12-01  816.21    896.24   -8.929524
## 3 2008-09-29 1106.42   1213.27   -8.806776

Finally, let’s plot the histogram of the daily return.

ggplot(mydata, aes(x=SP500return)) + geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

IN-CLASS EXERCISE

  1. Load in the NHL data from https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1
  2. Create a new variable score which is a difference between goals scored by the home team minus goals scored by the visiting team.
  3. Which two matches were the most lopsided during the 2016 season, i.e. in which two matches did the home team win and lose by the largest margin.

8. Turning daily data into monthly (application of lubridate, month(), lead() and filter())

Let’s turn our daily S&P500 data into monthly data by keeping only the last days of each month. This will allow us to calculate monthly returns from close of the previous month to close of this month. We will use function month() to identify the month component of each date. This function is part of the lubridate package.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
mydata$month=month(mydata$Date)
mydata <- arrange(mydata, Date)
head(mydata,n=25) #show the first 25 observations
##          Date  SP500 SP500_lag SP500return month
## 1  1990-01-02 359.69        NA          NA     1
## 2  1990-01-03 358.76    359.69 -0.25855375     1
## 3  1990-01-04 355.67    358.76 -0.86129917     1
## 4  1990-01-05 352.20    355.67 -0.97562372     1
## 5  1990-01-08 353.79    352.20  0.45144717     1
## 6  1990-01-09 349.62    353.79 -1.17866924     1
## 7  1990-01-10 347.31    349.62 -0.66071650     1
## 8  1990-01-11 348.53    347.31  0.35127149     1
## 9  1990-01-12 339.93    348.53 -2.46750811     1
## 10 1990-01-15 337.00    339.93 -0.86194012     1
## 11 1990-01-16 340.75    337.00  1.11275964     1
## 12 1990-01-17 337.40    340.75 -0.98312722     1
## 13 1990-01-18 338.19    337.40  0.23414583     1
## 14 1990-01-19 339.15    338.19  0.28386173     1
## 15 1990-01-22 330.38    339.15 -2.58587326     1
## 16 1990-01-23 331.61    330.38  0.37229251     1
## 17 1990-01-24 330.26    331.61 -0.40709721     1
## 18 1990-01-25 326.08    330.26 -1.26567640     1
## 19 1990-01-26 325.80    326.08 -0.08586820     1
## 20 1990-01-29 325.20    325.80 -0.18415470     1
## 21 1990-01-30 322.98    325.20 -0.68265711     1
## 22 1990-01-31 329.08    322.98  1.88865434     1
## 23 1990-02-01 328.79    329.08 -0.08811779     2
## 24 1990-02-02 330.92    328.79  0.64783112     2
## 25 1990-02-05 331.85    330.92  0.28103256     2

We only want to keep the observations for which the next month is different. We can access the next observation by using the function lead(). Then we will keep only observations for which the next month is different from this month. Also, let’s put the monthly data into a different data frame.

monthly <- filter(mydata, mydata$month != lead(mydata$month))
monthly <- select(monthly, Date, month, SP500)
head(monthly)
##         Date month  SP500
## 1 1990-01-31     1 329.08
## 2 1990-02-28     2 331.89
## 3 1990-03-30     3 339.94
## 4 1990-04-30     4 330.80
## 5 1990-05-31     5 361.23
## 6 1990-06-29     6 358.02

Excellent, monthly now contains only last days of each month.

IN-CLASS EXERCISE

  1. Create a subset of the NHL data that contains games played by Boston Bruins.
  2. SUPER CHALLENGE: How many games did Boston Bruins win?


HOMEWORK EXERCISE:

Write an R Markdown file that does the analysis and answers the questions below. Knit the R Markdown into an HTML or a pdf file, print it and bring to class.

  1. Load in closing values of NASDAQ Composite. Turn Date into a Date variable.
  2. Rename the Close variable as NASDQ
  3. Calculate a new variable that has daily price returns on NASDQ. (Make sure your data is properly sorted before calculating returns.)
  4. Calculate mean daily return between 1980 and 1990. (Hint: you will need two conditions in your filter() function, Date has to be greater than or equal to 1/1/1980 AND less than 1/1/1990, the AND in R is specified with the & sign)
  5. What was the best and worst day for NASDAQ since 1990? (Hint: the opposite of function head() is tail())
  6. Plot the histogram of daily returns on NASDQ since 1990, and the time plot of daily returns since 1990. Are the two graphs consistent with each other?
  7. Based on the graphs you created, would you say that volatility of NASDQ prices is constant over time?
  8. Create monthly data of (end of the month) closing prices on NASDAQ index.
  9. Calculate monthly price returns on the NASDAQ. What is the average monthly return? Which months were the best and worst?