Lab 3: Merging data frames

Learning objectives:

  • the piping operator %>%
  • merging data frames using full_join() and inner_join()
  • filtering out rows with missing observations using is.na() function
  • creating descriptive statistics table using stargazer()
  • selecting columns using subset(), or using [,] indexing, or using select()
  • saving data using write.csv()

0.Introduction

Let’s begin by reading in our S&P500 and NASDAQ data.

library(dplyr)
library(lubridate)
library(ggplot2)
sp <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^GSPC")
sp$Date <- as.Date(sp$Date, "%Y-%m-%d")
sp$month <- month(sp$Date)
nasdq <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=^IXIC")
nasdq$Date <- as.Date(nasdq$Date, "%Y-%m-%d")
nasdq$month <- month(nasdq$Date)

1. Using the pipe %>% operator

The pipe operator %>% takes output from preceding function and feeds it as input to the next function. This is useful if we want to apply several functions in succession to a data frame. For example, when we load our S&P500 data we want to arrange the data frame, then filter it, then select some variables, and then rename one of the variables. We can do this as follows.

sp <- sp %>% 
  arrange(Date) %>%
  filter(month!=lead(month)) %>%
  select(Date, Close) %>%
  rename(SP500 = Close)

We will do the same for the NASDAQ data frame:

nasdq <- nasdq %>% 
  arrange(Date) %>%
  filter(month!=lead(month)) %>%
  select(Date, Close) %>%
  rename(NASDQ = Close)

2. Merging data frames

We would like to merge the sp and nasdq data frames into one. In this case we want to put them side by side. The two data frames have different number of observations - the S&P500 begins in 1950, the NASDAQ begins in 1971. When putting the data side by side, we need to make sure that rows are correctly aligned by date. We will do this using function full_join(). The function takes two data frames as the first two arguments and a “by” variable as the third argument.

merged <- full_join(sp, nasdq, by="Date")
str(merged)
## 'data.frame':    804 obs. of  3 variables:
##  $ Date : Date, format: "1950-01-31" "1950-02-28" ...
##  $ SP500: num  17 17.2 17.3 18 18.8 ...
##  $ NASDQ: num  NA NA NA NA NA NA NA NA NA NA ...

How many observations should the merged data frame have? Since, NASDAQ has fewer observations, and since for every day that S&P 500 is calculated, NASDAQ is also calculated, the number of observations should be equal to the number of observations in the S&P data frame.

We see that R put NAs for values of NASDQ for dates that were not in the NASDQ data frame. Let’s filter out those dates. We will use function is.na() to do so. This function returns value TRUE if the expression inside it is NA, and FALSE if the expression is not NA. We only want to non-missing values of NASDQ so we keep values when is.na(NASDQ) is FALSE. Note the == sign indicates equal, != indicates not equal, | indicates or.

merged <- filter(merged, is.na(NASDQ) == FALSE)

We would have accomplished the same thing using merged <- filter(merged, !is.na(NASDQ)). The exclamation mark “!” stands for logical expression ‘not’, so !is.na(NASDQ) is TRUE when NASDQ is not missing.

Finally, if we had known we only wanted to keep observations that are in both data sets, we could have used function inner_join() which does the same thing as full_join() except it only merges rows that match in both data frames. Check that merged <- inner_join(mydata, nasdq, by="Date") give us the same thing as full_join and filter(). Quick check: How many observations should the result of inner_join() have?

Now that we have both SP500 and NASDAQ in one data frame we can plot them together.

ggplot(merged, aes(x=Date)) + 
  geom_line(aes(y=SP500), color="blue") + 
  geom_line(aes(y=NASDQ), color="red") +
  scale_y_continuous(trans="log", breaks=c(100,200,500,1000,2000, 4000)) +
  labs(title="S&P500 and NASDAQ Composite monthly closing values (log scale)") + ylab("") + xlab("")

IN-CLASS EXERCISE

  1. Load in the NHL data on games and results, and NHL data on arenas.
nhl <- read.csv("https://www.dropbox.com/s/krlwr6z38ol6wer/NHLseason2016.csv?raw=1")
arenas <- read.csv("https://www.dropbox.com/s/75aomym8bcfv2el/NHLarenas.csv?raw=1")
  1. Create a data frame that includes the percent occupancy rate for each game.

3. Creating a descriptive statistics table

Let’s calculate create a table with summary statistics of monthly returns on S&P and NASDAQ.

merged$SP500ret <- (merged$SP500 - lag(merged$SP500))/lag(merged$SP500)*100
merged$NASDQret <- (merged$NASDQ - lag(merged$NASDQ))/lag(merged$NASDQ)*100

We can either use the base R function summary() or function stargazer() from a package of the same name. The stargazer() produces tables that are somewhat more readable.

Since we are not interested in the descriptive statistics of the date variable or the level of the indices, we only ask for summaries of two columns. There are many ways to select columns from a data frame. Below I illustrate using base function subset(), using [,] which allow us to index rows and columns, or using dplyr’s select() function. Notice we nest these functions inside the stargazer() function.

library(stargazer)
summary(subset(merged,select=c("SP500ret","NASDQret")))
##     SP500ret           NASDQret       
##  Min.   :-21.7630   Min.   :-27.2339  
##  1st Qu.: -1.8147   1st Qu.: -2.3111  
##  Median :  0.9091   Median :  1.2909  
##  Mean   :  0.6687   Mean   :  0.9119  
##  3rd Qu.:  3.5174   3rd Qu.:  4.4127  
##  Max.   : 16.3047   Max.   : 21.9759  
##  NA's   :1          NA's   :1
stargazer(merged[,c("SP500ret","NASDQret")], type="text", median=TRUE) 
## 
## ==================================================
## Statistic  N  Mean  St. Dev.   Min   Median  Max  
## --------------------------------------------------
## SP500ret  550 0.669  4.359   -21.763 0.909  16.305
## NASDQret  550 0.912  6.063   -27.234 1.291  21.976
## --------------------------------------------------
stargazer(select(merged, SP500ret,NASDQret), type="text", median=TRUE) 
## 
## ==================================================
## Statistic  N  Mean  St. Dev.   Min   Median  Max  
## --------------------------------------------------
## SP500ret  550 0.669  4.359   -21.763 0.909  16.305
## NASDQret  550 0.912  6.063   -27.234 1.291  21.976
## --------------------------------------------------

Let’s also calculate the correlation coefficient between the returns on the two indices.

cor(merged$SP500ret, merged$NASDQret, use = "complete.obs")
## [1] 0.858424

4. Saving data using write.csv()

We did some data manipulation and created new data sets that we may want to use in the next lab. Let’s save the data so that we don’t have to redo all of that manipulation. We’ll use function write.csv() which takes as arguments the data frame to be saved and the name of the new .csv file. If you don’t include a full path, the file will be saved in your working directory. To see what is your working directory type getwd() in the console. If you are working within an R Studio project, the working directory is the `project’ directory. We’ll also use an option that tells the function that we don’t want row numbers in the data.

write.csv(merged, "SP_NASDQ_monthly.csv", row.names = FALSE)
write.csv(nhl, "NHLseason_with_arenas.csv", row.names = FALSE)


Exercises

  1. Data on New York City weather are placed at this address: “https://www.dropbox.com/s/qzwqhe5i2lfj3r6/NYC%20weather.csv?raw=1”. The data has information on daily precipitation (in tenths of millimeters) and min and max temperatures (in tenths of degree Celsius) as recorded by the Belevedere Tower in Central Park. The data comes from http://www.ncdc.noaa.gov. Note that missing values are indicated as -9999. Load the data into R.

  2. Summarize the variables. Are the values reasonable? Are there any missing values? If so, drop the observations that have missing values. How many observations did you lose by dropping these?

  3. Change the DATE variable from a ‘numeric’ type to ‘Date’ type. (Hint 1: as.Date() does not work on numeric, so you first need to turn DATE into character using as.character() Hint 2: Note also that the format of the date is “%Y%m%d” i.e. no dashes between year, month and day.)

  4. Plot the minimum and maximum temperatures on a time line for the entire period. Then do the same since 2014.

  5. Load in the daily S&P 500 data from Yahoo! Make sure your ‘date’ variable is of ‘Date’ type. Calculate daily returns on S&P 500. Drop all variables except date and returns. Check the descriptive stats of your daily return.

  6. Merge your daily S&P500 data with your NYC weather data. Keep only dates that are in both data sets. Which observations do you lose from either data set? What is the correlation between temperature and returns?