The easiest way to prepare data for R is to hold the data in a csv file with the dates in the first column and variables in columns. It is also possible to store data in excel but this requires you to download an additional package of functions. We will take a look at that later in the week.
If you store data is a set format it is easier to process. It is best practice to do all the data processing in R so that it can be reproduced easily when the data is updated and so that all the steps are clear.
If you have not yet found a way to manage your information you should find a solution as soon as possible. RStudio can help you with Projects. You can start at new project by selecting File and New Project. You can create a new folder or directory for this or use an existing one. You might then decide to divide this project into set folders or directories that contain things like Data, docs, R and Knitr files. It is good practice to have a README file to explain the layout and details of the directory.
For example, I have the following set up for my RStudio files.
### Practice
Create a project in an appropriate folder or directory
Assuming that you have saved the data into a file called bac.csv in your Data directory, the function to read data into R is read.csv. If it works correctly, you should see the new object called da (or whatever you want to call it) in the Environment. You can take a look at this object using the functions head and str. The first will show the first 6 lines and the second will show the structure.
da <- read.csv('../../Data/bac.csv')
head(da)
## Date BAC SPY
## 1 03/12/2012 11.58 139.59
## 2 01/11/2012 9.82 138.36
## 3 01/10/2012 9.28 137.58
## 4 04/09/2012 8.79 140.13
## 5 01/08/2012 7.95 136.66
## 6 02/07/2012 7.30 133.32
str(da)
## 'data.frame': 156 obs. of 3 variables:
## $ Date: Factor w/ 156 levels "01/02/2000","01/02/2001",..: 152 91 81 156 65 119 47 37 112 20 ...
## $ BAC : num 11.58 9.82 9.28 8.79 7.95 ...
## $ SPY : num 140 138 138 140 137 ...
You can see from these functions that we have a data frame with monthly data for Bank of America and the S&P 500 index with dates in the first column. The price data are numerics (num) while the date is a character (chr). We would like to turn the dates into a date object so that R knows that they run in sequence from yesterday through today. We can doi this using the as.Date function. We need to tell the function the structure of the data. You can find out more about this here:
Mollie's Research Blog: Importing dates
da$Date <- as.Date(da$Date, format = "%d/%m/%Y")
str(da)
## 'data.frame': 156 obs. of 3 variables:
## $ Date: Date, format: "2012-12-03" "2012-11-01" ...
## $ BAC : num 11.58 9.82 9.28 8.79 7.95 ...
## $ SPY : num 140 138 138 140 137 ...
We now have dates and not just words.
To assess the performance of Bank of America and the S&P 500 index we will use the sample (from the past) as a way to understand the population (including the future). If the sample is representative of the future, the mean will give us the most likely outcome. This is the expected return.
First we need to calculate the returns. This is a little more tricky than excel, but once we have written the code once, we can use it over and over again. Next week we will write an return function.
da$BACR <- c(da$BAC[1:length(da$BAC) - 1] / da$BAC[2:length(da$BAC)] - 1, NA)
da$SPYR <- c(da$SPY[1:length(da$SPY) - 1] / da$SPY[2:length(da$SPY)] - 1, NA)
head(da)
## Date BAC SPY BACR SPYR
## 1 2012-12-03 11.58 139.59 0.17922607 0.008889853
## 2 2012-11-01 9.82 138.36 0.05818966 0.005669429
## 3 2012-10-01 9.28 137.58 0.05574516 -0.018197388
## 4 2012-09-04 8.79 140.13 0.10566038 0.025391483
## 5 2012-08-01 7.95 136.66 0.08904110 0.025052505
## 6 2012-07-02 7.30 133.32 -0.10319410 0.011839709
tail(da)
## Date BAC SPY BACR SPYR
## 151 2000-06-01 14.29 112.96 -0.22421281 0.01977070
## 152 2000-05-01 18.42 110.77 0.14197148 -0.01572774
## 153 2000-04-03 16.13 112.54 -0.06546929 -0.03523360
## 154 2000-03-01 17.26 116.65 0.15220294 0.09695317
## 155 2000-02-01 14.98 106.34 -0.05009512 -0.01518800
## 156 2000-01-03 15.77 107.98 NA NA
Now calculate the mean, standard deviation, max and min.
mean(da$BACR, na.rm = TRUE)
## [1] 0.006797748
mean(da$SPYR, na.rm = TRUE)
## [1] 0.00272119
sd(da$BACR, na.rm = TRUE)
## [1] 0.1308765
sd(da$SPYR, na.rm = TRUE)
## [1] 0.04593242
We would like to look at the distribution of the returns. We do do this by constructing a histogram using the hist function.
hist(da$BACR, prob = TRUE, main = "Histogram of BAC Returns",
xlab = "Returns", ylab = 'Probability', col = 'cornflowerblue',
breaks = seq(-0.6, 0.8, by = 0.05))
Now calculate the price over time.
plot(da$Date, da$BAC, type = 'l', main = "Bank of America",
xlab = 'Date', ylab = "Price")
We would like to plot both on the same graph. Remember the re-basing that we did with the GDP data? We can create new series that are re-based to be 100 at the beginning. We do this by dividing by the first price and then multiplying by 100. Apply this formula to all the rows of data.
da$BACRebased <- da$BAC/da$BAC[length(da$BAC)] * 100
da$SPYRebased <- da$SPY/da$SPY[length(da$SPY)] * 100
head(da)
## Date BAC SPY BACR SPYR BACRebased SPYRebased
## 1 2012-12-03 11.58 139.59 0.17922607 0.008889853 73.43056 129.2739
## 2 2012-11-01 9.82 138.36 0.05818966 0.005669429 62.27013 128.1348
## 3 2012-10-01 9.28 137.58 0.05574516 -0.018197388 58.84591 127.4125
## 4 2012-09-04 8.79 140.13 0.10566038 0.025391483 55.73874 129.7740
## 5 2012-08-01 7.95 136.66 0.08904110 0.025052505 50.41218 126.5605
## 6 2012-07-02 7.30 133.32 -0.10319410 0.011839709 46.29042 123.4673
tail(da)
## Date BAC SPY BACR SPYR BACRebased SPYRebased
## 151 2000-06-01 14.29 112.96 -0.22421281 0.01977070 90.61509 104.6120
## 152 2000-05-01 18.42 110.77 0.14197148 -0.01572774 116.80406 102.5838
## 153 2000-04-03 16.13 112.54 -0.06546929 -0.03523360 102.28282 104.2230
## 154 2000-03-01 17.26 116.65 0.15220294 0.09695317 109.44832 108.0293
## 155 2000-02-01 14.98 106.34 -0.05009512 -0.01518800 94.99049 98.4812
## 156 2000-01-03 15.77 107.98 NA NA 100.00000 100.0000
Now plot
plot(da$Date, da$BACRebased, type = 'l', xlab = "Date",
ylab = "Price", main = "BAC and S&P 500", lty = 1)
lines(da$Date, da$SPYRebased, col = 'blue', lty = 2)
legend('topleft', inset = 0.05, legend = c("BAC", "SPY"),
col = c('black', 'blue'), lty = c(1, 2), cex = 0.8)