select()rename()filter()arrange()Dplyr package is very useful for manipulating data. Make sure you have it installed. Load it using library(dplyr) at the beginning of each R script or markdown files.
Let’s bring up our S&P 500 closing values data and turn the date variable from a factor type into a date type as we did in the previous lab.
library(dplyr)
library(ggplot2)
mydata <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^GSPC")
mydata$Date <- as.Date(mydata$Date, "%Y-%m-%d")
The mydata data frame contains a number of variables we do not intend to use. Let’s use a function select() to select the variables we plan to use. The function takes the name of a data frame as the first argument, and the names of the variables to keep as the following arguments.
mydata <- select(mydata, Date, Close)
Instead of Close, let’s give our S&P500 variable more informative name:
mydata <- rename(mydata, SP500=Close)
mydata <- arrange(mydata, Date)
head(mydata)
## Date SP500
## 1 1950-01-03 16.66
## 2 1950-01-04 16.85
## 3 1950-01-05 16.93
## 4 1950-01-06 16.98
## 5 1950-01-09 17.08
## 6 1950-01-10 17.03
Let’s look at data starting in January 1, 1990. Function filter() keeps only observations that satisfy the condition inside the function.
mydata <- filter(mydata, Date>=as.Date("1990-01-01"))
head(mydata, n=3)
## Date SP500
## 1 1990-01-02 359.69
## 2 1990-01-03 358.76
## 3 1990-01-04 355.67
Let’s calculate the daily price returns on the S&P500, i.e. the percentage change in the index from previous day’s closing value to today’s closing value. We first create a new variable that equals previous day’s closing value. We use the function lag() which takes the previous observation.
mydata$SP500_lag <- lag(mydata$SP500)
And then create another new variable called SP500return
mydata$SP500return <- (mydata$SP500-mydata$SP500_lag)/mydata$SP500_lag*100
head(mydata, n=3)
## Date SP500 SP500_lag SP500return
## 1 1990-01-02 359.69 NA NA
## 2 1990-01-03 358.76 359.69 -0.2585538
## 3 1990-01-04 355.67 358.76 -0.8612992
summary(mydata$SP500return)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.03500 -0.46250 0.05105 0.03344 0.56410 11.58000 1
We see the minimum return was around -9%. Let’s find out when this happened by sorting the data by the return and looking at a few observations:
mydata <- arrange(mydata, SP500return)
head(mydata, n=3)
## Date SP500 SP500_lag SP500return
## 1 2008-10-15 907.84 998.01 -9.034978
## 2 2008-12-01 816.21 896.24 -8.929524
## 3 2008-09-29 1106.42 1213.27 -8.806776
Finally, let’s plot the histogram of the daily return.
ggplot(mydata, aes(x=SP500return)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
lubridate, month(), lead() and filter())Let’s turn our daily S&P500 data into monthly data by keeping only the last days of each month. This will allow us to calculate monthly returns from close of the previous month to close of this month. We will use function month() to identify the month component of each date. This function is part of the lubridate package.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
mydata$month=month(mydata$Date)
mydata <- arrange(mydata, Date)
head(mydata,n=25) #show the first 25 observations
## Date SP500 SP500_lag SP500return month
## 1 1990-01-02 359.69 NA NA 1
## 2 1990-01-03 358.76 359.69 -0.25855375 1
## 3 1990-01-04 355.67 358.76 -0.86129917 1
## 4 1990-01-05 352.20 355.67 -0.97562372 1
## 5 1990-01-08 353.79 352.20 0.45144717 1
## 6 1990-01-09 349.62 353.79 -1.17866924 1
## 7 1990-01-10 347.31 349.62 -0.66071650 1
## 8 1990-01-11 348.53 347.31 0.35127149 1
## 9 1990-01-12 339.93 348.53 -2.46750811 1
## 10 1990-01-15 337.00 339.93 -0.86194012 1
## 11 1990-01-16 340.75 337.00 1.11275964 1
## 12 1990-01-17 337.40 340.75 -0.98312722 1
## 13 1990-01-18 338.19 337.40 0.23414583 1
## 14 1990-01-19 339.15 338.19 0.28386173 1
## 15 1990-01-22 330.38 339.15 -2.58587326 1
## 16 1990-01-23 331.61 330.38 0.37229251 1
## 17 1990-01-24 330.26 331.61 -0.40709721 1
## 18 1990-01-25 326.08 330.26 -1.26567640 1
## 19 1990-01-26 325.80 326.08 -0.08586820 1
## 20 1990-01-29 325.20 325.80 -0.18415470 1
## 21 1990-01-30 322.98 325.20 -0.68265711 1
## 22 1990-01-31 329.08 322.98 1.88865434 1
## 23 1990-02-01 328.79 329.08 -0.08811779 2
## 24 1990-02-02 330.92 328.79 0.64783112 2
## 25 1990-02-05 331.85 330.92 0.28103256 2
We only want to keep the observations for which the next month is different. We can access the next observation by using the function lead(). Then we will keep only observations for which the next month is different from this month. Also, let’s put the monthly data into a different data frame.
monthly <- filter(mydata, mydata$month != lead(mydata$month))
monthly <- select(monthly, Date, month, SP500)
head(monthly)
## Date month SP500
## 1 1990-01-31 1 329.08
## 2 1990-02-28 2 331.89
## 3 1990-03-30 3 339.94
## 4 1990-04-30 4 330.80
## 5 1990-05-31 5 361.23
## 6 1990-06-29 6 358.02
Excellent, monthly now contains only last days of each month.
Write an R Markdown file that does the analysis and answers the questions below. Knit the R Markdown into an HTML or a pdf file, print it and bring to class.
Date variable.Close variable as NASDQNASDQ. (Make sure your data is properly sorted before calculating returns.)filter() function, Date has to be greater than or equal to 1/1/1980 AND less than 1/1/1990, the AND in R is specified with the & sign)head() is tail())