%>%
full_join()
and inner_join()
is.na()
functionstargazer()
subset()
, or using [,]
indexing, or using select()
write.csv()
Let’s begin by reading in our S&P500 and NASDAQ data.
library(dplyr)
library(lubridate)
library(ggplot2)
sp <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^GSPC")
sp$Date <- as.Date(sp$Date, "%Y-%m-%d")
sp$month <- month(sp$Date)
nasdq <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^IXIC")
nasdq$Date <- as.Date(nasdq$Date, "%Y-%m-%d")
nasdq$month <- month(nasdq$Date)
%>%
operatorThe pipe operator %>%
takes output from preceding function and feeds it as input to the next function. This is useful if we want to apply several functions in succession to a data frame. For example, when we load our S&P500 data we want to arrange
the data frame, then filter
it, then select
some variables, and then rename
one of the variables. We can do this as follows.
sp <- sp %>%
arrange(Date) %>%
filter(month!=lead(month)) %>%
select(Date, Close) %>%
rename(SP500 = Close)
We will do the same for the NASDAQ data frame:
nasdq <- nasdq %>%
arrange(Date) %>%
filter(month!=lead(month)) %>%
select(Date, Close) %>%
rename(NASDQ = Close)
We would like to merge the sp
and nasdq
data frames into one. In this case we want to put them side by side. The two data frames have different number of observations - the S&P500 begins in 1950, the NASDAQ begins in 1971. When putting the data side by side, we need to make sure that rows are correctly aligned by date. We will do this using function full_join()
. The function takes two data frames as the first two arguments and a “by” variable as the third argument.
merged <- full_join(sp, nasdq, by="Date")
str(merged)
## 'data.frame': 804 obs. of 3 variables:
## $ Date : Date, format: "1950-01-31" "1950-02-28" ...
## $ SP500: num 17 17.2 17.3 18 18.8 ...
## $ NASDQ: num NA NA NA NA NA NA NA NA NA NA ...
How many observations should the merged data frame have? Since, NASDAQ has fewer observations, and since for every day that S&P 500 is calculated, NASDAQ is also calculated, the number of observations should be equal to the number of observations in the S&P data frame.
We see that R put NAs
for values of NASDQ for dates that were not in the NASDQ data frame. Let’s filter out those dates. We will use function is.na()
to do so. This function returns value TRUE if the expression inside it is NA, and FALSE if the expression is not NA. We only want to non-missing values of NASDQ so we keep values when is.na(NASDQ)
is FALSE. Note the ==
sign indicates equal, !=
indicates not equal, |
indicates or.
merged <- filter(merged, is.na(NASDQ) == FALSE)
We would have accomplished the same thing using merged <- filter(merged, !is.na(NASDQ))
. The exclamation mark “!” stands for logical expression ‘not’, so !is.na(NASDQ)
is TRUE when NASDQ is not missing.
Finally, if we had known we only wanted to keep observations that are in both data sets, we could have used function inner_join()
which does the same thing as full_join()
except it only merges rows that match in both data frames. Check that merged <- inner_join(mydata, nasdq, by="Date")
give us the same thing as full_join
and filter()
. Quick check: How many observations should the result of inner_join()
have?
Now that we have both SP500 and NASDAQ in one data frame we can plot them together.
ggplot(merged, aes(x=Date)) +
geom_line(aes(y=SP500), color="blue") +
geom_line(aes(y=NASDQ), color="red") +
scale_y_continuous(trans="log", breaks=c(100,200,500,1000,2000, 4000)) +
labs(title="S&P500 and NASDAQ Composite monthly closing values (log scale)") + ylab("") + xlab("")
nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
arenas <- read.csv("https://www.dropbox.com/s/75aomym8bcfv2el/NHLarenas.csv?raw=1")
Let’s calculate create a table with summary statistics of monthly returns on S&P and NASDAQ.
merged$SP500ret <- (merged$SP500 - lag(merged$SP500))/lag(merged$SP500)*100
merged$NASDQret <- (merged$NASDQ - lag(merged$NASDQ))/lag(merged$NASDQ)*100
We can either use the base R function summary()
or function stargazer()
from a package of the same name. The stargazer()
produces tables that are somewhat more readable.
Since we are not interested in the descriptive statistics of the date
variable or the level of the indices, we only ask for summaries of two columns. There are many ways to select columns from a data frame. Below I illustrate using base function subset()
, using [,] which allow us to index rows and columns, or using dplyr
’s select()
function. Notice we nest these functions inside the stargazer()
function.
library(stargazer)
summary(subset(merged,select=c("SP500ret","NASDQret")))
## SP500ret NASDQret
## Min. :-21.7630 Min. :-27.2339
## 1st Qu.: -1.8147 1st Qu.: -2.3111
## Median : 0.9091 Median : 1.2909
## Mean : 0.6687 Mean : 0.9119
## 3rd Qu.: 3.5174 3rd Qu.: 4.4127
## Max. : 16.3047 Max. : 21.9759
## NA's :1 NA's :1
stargazer(merged[,c("SP500ret","NASDQret")], type="text", median=TRUE)
##
## ==================================================
## Statistic N Mean St. Dev. Min Median Max
## --------------------------------------------------
## SP500ret 550 0.669 4.359 -21.763 0.909 16.305
## NASDQret 550 0.912 6.063 -27.234 1.291 21.976
## --------------------------------------------------
stargazer(select(merged, SP500ret,NASDQret), type="text", median=TRUE)
##
## ==================================================
## Statistic N Mean St. Dev. Min Median Max
## --------------------------------------------------
## SP500ret 550 0.669 4.359 -21.763 0.909 16.305
## NASDQret 550 0.912 6.063 -27.234 1.291 21.976
## --------------------------------------------------
Let’s also calculate the correlation coefficient between the returns on the two indices.
cor(merged$SP500ret, merged$NASDQret, use = "complete.obs")
## [1] 0.858424
write.csv()
We did some data manipulation and created new data sets that we may want to use in the next lab. Let’s save the data so that we don’t have to redo all of that manipulation. We’ll use function write.csv()
which takes as arguments the data frame to be saved and the name of the new .csv file. If you don’t include a full path, the file will be saved in your working directory. To see what is your working directory type getwd()
in the console. If you are working within an R Studio project
, the working directory is the `project’ directory. We’ll also use an option that tells the function that we don’t want row numbers in the data.
write.csv(merged, "SP_NASDQ_monthly.csv", row.names = FALSE)
write.csv(nhl, "NHLseason_with_arenas.csv", row.names = FALSE)
Data on New York City weather are placed at this address: “https://www.dropbox.com/s/qzwqhe5i2lfj3r6/NYC%20weather.csv?raw=1”. The data has information on daily precipitation (in tenths of millimeters) and min and max temperatures (in tenths of degree Celsius) as recorded by the Belevedere Tower in Central Park. The data comes from http://www.ncdc.noaa.gov. Note that missing values are indicated as -9999. Load the data into R.
Summarize the variables. Are the values reasonable? Are there any missing values? If so, drop the observations that have missing values. How many observations did you lose by dropping these?
Change the DATE
variable from a ‘numeric’ type to ‘Date’ type. (Hint 1: as.Date()
does not work on numeric, so you first need to turn DATE
into character using as.character()
Hint 2: Note also that the format of the date is “%Y%m%d” i.e. no dashes between year, month and day.)
Plot the minimum and maximum temperatures on a time line for the entire period. Then do the same since 2014.
Load in the daily S&P 500 data from Yahoo! Make sure your ‘date’ variable is of ‘Date’ type. Calculate daily returns on S&P 500. Drop all variables except date and returns. Check the descriptive stats of your daily return.
Merge your daily S&P500 data with your NYC weather data. Keep only dates that are in both data sets. Which observations do you lose from either data set? What is the correlation between temperature and returns?