Quandl: Financial, Economic and Alternative Data

Dataset: Apple - daily stock price data

Open, high, low, close, adjusted data and volumes


(a) Description of the dataset:

Quandl is an open data provider that offers thousands of datasets. These datasets can be imported directly in R through functions contained in the Quandl package.

There are a lot of datasets in Quandl referred to stock price. In these dataset, for every trading day, Quandl collects the date and different information. Particulary these information are:

  1. Date: Registered in format “YYYY-MM-DD”
  2. Open: First price at which the stock traded at the begining of the trading day
  3. High: The highest trading price registered during the day
  4. Low: The lowest trading price registered during the day
  5. Close: Last price at which the stock traded during the day
  6. Volume: Total quantity of shares tradred during the trading day
  7. Dividend: The amount of dividend distributed per share
  8. Split: Number of shares in which the pre-existing share is divided
  9. Adjusted Open: The opening price adjusted for all applicable splits and dividends distributions
  10. Adjusted High: The higest trading price adjusted for all applicable splits and dividends distributions
  11. Adjusted Low: The lowest trading price adjusted for all applicable splits and dividends distributions
  12. Adjusted Close: The last trading price at which the stock traded during the day adjusted for all applicable splits and dividends distributions
  13. Adjusted Volume: Total quantity of shares tradred during the trading day adjusted for all applicable splits

All these informations are arranged by column in a (n by 10) matrix (where n is the number of observation).

Importing Quandl datasets

(b) Install the package that allows to access the datasets

To acces the dataset we need to install the Quandl package and recall it with the library function

#install.packages("Quandl", repos = "http://cran.us.r-project.org")
library(Quandl)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

(c) Variables description:

The aim of our work is computing the rate of return of the Apple and Microsoft stocks and the rate of return of an equal weighted portfolio of these two stocks. We then want to plot the return on 100 dollars invested in each stock and in the portfolio.

To carry out this task we subsetted the two dataframes of the stocks, keeping only these variables:

  • Variable 1: date

  • Variable 2: adj_close apple, that we called ‘Apple’ in the dataset

  • Variable 3: adj_close msft, that we called ‘Msft’ in the dataset

The adjusted close price is the last trading price at which the stock traded during the day adjusted for all applicable splits and dividends distributions.

A more general explaination of the codes and dataframes obtained from Quandl is avaiable at this link:

Quandl package: https://cran.r-project.org/web/packages/Quandl/Quandl.pdf

Apple: https://www.quandl.com/data/EOD/AAPL

Microsoft: https://www.quandl.com/data/EOD/MSFT

Adjusted data: http://www.crsp.com/products/documentation/crsp-calculations

(d) Data download:

In Quandl, every dataset is identified by a unique id. This unique id can be found on the Quandl website as the Quandl Code. You can use this ID to access your dataset in R. For example: Get the Apple or Microsoft dataframes (Quandl code is “EOD/AAPL” and “EOD/MSFT”).

The Quandl package is able to return data in different formats. We choose to use the dataframe format, obtained by specifing “raw” in the input “type” in the Quandl() function.

In order to get access to the data, is necessary to insert the personal key which can be obtained in the website after the registration.

apple <- Quandl("EOD/AAPL", api_key = "WSfEZRAC2EX_nHdvizCq", type="raw")
head(apple)
##         Date   Open    High     Low  Close   Volume Dividend Split Adj_Open
## 1 2017-12-28 171.00 171.850 170.480 171.08 16480187        0     1 165.9712
## 2 2017-12-27 170.10 170.780 169.710 170.60 21498213        0     1 165.0977
## 3 2017-12-26 170.80 171.470 169.679 170.57 33185536        0     1 165.7771
## 4 2017-12-22 174.68 175.424 174.500 175.01 16349444        0     1 169.5430
## 5 2017-12-21 174.17 176.020 174.100 175.01 20949896        0     1 169.0480
## 6 2017-12-20 174.87 175.420 173.250 174.35 23475649        0     1 169.7274
##   Adj_High  Adj_Low Adj_Close Adj_Volume
## 1 166.7962 165.4665  166.0489   16480187
## 2 165.7577 164.7191  165.5830   21498213
## 3 166.4274 164.6891  165.5539   33185536
## 4 170.2651 169.3683  169.8633   16349444
## 5 170.8436 168.9800  169.8633   20949896
## 6 170.2612 168.1550  169.2227   23475649
msft <- Quandl("EOD/MSFT", api_key="WSfEZRAC2EX_nHdvizCq", type = "raw")
head(msft)
##         Date  Open    High    Low Close   Volume Dividend Split Adj_Open
## 1 2017-12-28 85.90 85.9300 85.550 85.72 10594344        0     1 83.48228
## 2 2017-12-27 85.65 85.9800 85.215 85.71 14678025        0     1 83.23932
## 3 2017-12-26 85.31 85.5346 85.030 85.40  9891237        0     1 82.90889
## 4 2017-12-22 85.40 85.6300 84.920 85.51 14145841        0     1 82.99636
## 5 2017-12-21 86.05 86.1000 85.400 85.50 17990745        0     1 83.62806
## 6 2017-12-20 86.20 86.3000 84.710 85.52 23674931        0     1 83.77384
##   Adj_High  Adj_Low Adj_Close Adj_Volume
## 1 83.51144 83.14214  83.30735   10594344
## 2 83.56003 82.81656  83.29763   14678025
## 3 83.12717 82.63677  82.99636    9891237
## 4 83.21988 82.52987  83.10326   14145841
## 5 83.67666 82.99636  83.09354   17990745
## 6 83.87103 82.32578  83.11298   23674931
dataset <- data.frame(Date = apple$Date, Apple = apple$Adj_Close, Msft = msft$Adj_Close)
head(dataset)
##         Date    Apple     Msft
## 1 2017-12-28 166.0489 83.30735
## 2 2017-12-27 165.5830 83.29763
## 3 2017-12-26 165.5539 82.99636
## 4 2017-12-22 169.8633 83.10326
## 5 2017-12-21 169.8633 83.09354
## 6 2017-12-20 169.2227 83.11298

Obtaining basic information about the dataset

To show the data structure and some meaningful values of the dataset, we used the functions summary() and str()

summary(dataset[,-1])
##      Apple             Msft      
##  Min.   : 57.53   Min.   :27.11  
##  1st Qu.: 89.57   1st Qu.:38.83  
##  Median :104.05   Median :46.20  
##  Mean   :106.41   Mean   :48.85  
##  3rd Qu.:118.47   3rd Qu.:57.61  
##  Max.   :171.23   Max.   :84.41
str(dataset)
## 'data.frame':    1090 obs. of  3 variables:
##  $ Date : Date, format: "2017-12-28" "2017-12-27" ...
##  $ Apple: num  166 166 166 170 170 ...
##  $ Msft : num  83.3 83.3 83 83.1 83.1 ...

(e) operations, subset, statistics and graph:

After taking a look at the Apple dataframe we notice a significant change in the stock price, caused by a split of the shares. We then decided to find when it occourred and divide the dataset into two periods: before and after the split.

# creating subset 
split_date <- subset(apple, Split != 1, select = Date)   # finding the split date
split_date
##           Date
## 898 2014-06-09
pre_split_dataset <- subset(dataset, as.numeric(Date) < as.numeric(split_date))
post_split_dataset <- subset(dataset, as.numeric(Date) >= as.numeric(split_date))


To show how the two stocks and the portfolio performed in the period that we are studying, we compute the return of both.

# Apple return
apple_final_price <- dataset$Apple[-length(dataset$Apple)]
apple_initial_price <- dataset$Apple[-1]
apple_return_list <- (apple_final_price - apple_initial_price) / apple_initial_price

# Microsoft return
msft_return_list <- (dataset$Msft[-length(dataset$Msft)] - dataset$Msft[-1]) / dataset$Msft[-length(dataset$Msft)]

# Equal weighted portfolio return 
portfolio_return_list <- 0.50 * apple_return_list + 0.5* msft_return_list

# Collecting the return in a dataframe
return_dataframe <- data.frame(Date = dataset$Date[-length(dataset$Date)], Apple.Return = apple_return_list, Msft.Return = msft_return_list, Portfolio.Return = portfolio_return_list)

Basic Statistics About Returns

To have a better rapresentation of the range of values as well as the generic situation of the returns of the shares that we are analyzing, we compute some basic statistics.

Mean <- apply(return_dataframe[,-1], 2, mean)
Stdv <- apply(return_dataframe[,-1], 2, sd)
Min <- apply(return_dataframe[,-1], 2, min)
Max <- apply(return_dataframe[,-1], 2, max)
Corr <- cor(return_dataframe$Apple.Return, return_dataframe$Msft.Return)

statistics <- rbind(Mean,Stdv,Min,Max, Corr)
statistics
##      Apple.Return   Msft.Return Portfolio.Return
## Mean  0.001000488  0.0009166808     0.0009585843
## Stdv  0.014285053  0.0135815776     0.0116909252
## Min  -0.079927339 -0.1019690577    -0.0684911601
## Max   0.081981896  0.0946312931     0.0612704854
## Corr  0.407674976  0.4076749756     0.4076749756

Prepare data for the plot and plot

We have to set the returns in a chronological order, from the oldest to the latest, since the starting dataset is in the opposite order.

# Reverse the vectors of returns from the oldest to the latest return
apple_return_list <- rev(apple_return_list)
msft_return_list <- rev(msft_return_list)

Then we calculate the cumulative returns of both stocks and the portfolio, starting from a base value of 100 which is the initial investment, to see more clearly the trends and make them more comparable.

apple_cumulative_return_list <- c()
apple_cumulative_return <- 100
for(i in 1:length(apple_return_list)){
  apple_cumulative_return <- apple_cumulative_return * (1+apple_return_list[i])
  apple_cumulative_return_list <- c(apple_cumulative_return_list, apple_cumulative_return)
  }

msft_cumulative_return_list <- c()
msft_cumulative_return <- 100
for(i in 1:length(msft_return_list)){
  msft_cumulative_return <- msft_cumulative_return * (1+msft_return_list[i])
  msft_cumulative_return_list <- c(msft_cumulative_return_list, msft_cumulative_return)
}

pf_cumulative_return_list <- c()
pf_cumulative_return <- 100
for(i in 1:length(portfolio_return_list)){
  pf_cumulative_return <- pf_cumulative_return * (1+portfolio_return_list[i])
  pf_cumulative_return_list <- c(pf_cumulative_return_list, pf_cumulative_return)
}

Lastly we plot the returns of the stocks and the portfolio.

plot(rev(apple$Date[-1]), apple_cumulative_return_list , "l", xlab = "Date", ylab = "Returns")
lines(rev(apple$Date[-1]), msft_cumulative_return_list, "l", col = "red")
lines(rev(apple$Date[-1]), pf_cumulative_return_list, "l", col = "blue", lwd = 1.7)
legend(x = "topleft", legend = c("AAPL", "MSFT", "Equal weighted portfolio"), lty = 1, col = c("black", "red", "blue"))

(f) Identifying the nature of the dataset:

The native format of our dataset is wide becuase we have for every observation (one day) 2 different variables for the prices of the two different stock. In order to transform the format from wide to long we have to create a column containing all the values and another column listing the context of the value.

(g) Transform the dataset into the other representation and output a small part of the dataset in both representations:

#install.packages("tidyr")
library(tidyr)   # we use the gather function which is in the tidyr library
dataset_long <- gather(dataset, Company, Price, -Date)
dataset_long <- dataset_long[order(dataset_long$Date, decreasing = TRUE),]

head(dataset)
##         Date    Apple     Msft
## 1 2017-12-28 166.0489 83.30735
## 2 2017-12-27 165.5830 83.29763
## 3 2017-12-26 165.5539 82.99636
## 4 2017-12-22 169.8633 83.10326
## 5 2017-12-21 169.8633 83.09354
## 6 2017-12-20 169.2227 83.11298
head(dataset_long)
##            Date Company     Price
## 1    2017-12-28   Apple 166.04885
## 1091 2017-12-28    Msft  83.30735
## 2    2017-12-27   Apple 165.58297
## 1092 2017-12-27    Msft  83.29763
## 3    2017-12-26   Apple 165.55385
## 1093 2017-12-26    Msft  82.99636

Part 2

(a) Find a data source that is not mentioned on the class slides:

We find a dataframe about Billboard 100 top song’s. We retrived it from

URL: https://github.com/mikkelkrogsholm/billboard

and the datas are originally taken from Spotify, here

URL: https://developer.spotify.com/web-api/get-audio-features/

(b) One paragraph about the dataset:

This dataset contains informations regarding songs on the Billboard Hot 100 list from 1960 to 2016. The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States. (From: Wikipedia, the free encyclopedia, 10 November 2019,

URL: https://en.wikipedia.org/w/index.php?title=Billboard_Hot_100&oldid=925689701

(c) Install the package that allows to access your dataset.:

We installed also the package matlib in order to perform some matrix calculation which will be usefull, later on the exercise, to create a linear regression model between two variables.

#install.packages("billboard", repos = "http://cran.us.r-project.org")
#install.packages("matlib", repos = "http://cran.us.r-project.org")
library(billboard)
library(matlib)

(d) Choose an interesting example:

The data frame is composed by 5497 rows and 23 variables, but we choose to utilize only 10 variables and only 1272 rows (years from 2000 till the end of 2012) The variables that we are using are:

year = year

artist_name = the artist of the song

explicit = if the track is rated as explicit

track_name = the name of the track

danceability = Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

energy = Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

loudness = The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

mode = Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness = Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

valence = A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

duration_ms = The duration of the track in milliseconds.

for the complete informations about all the variables in the dataset consult: https://cran.r-project.org/web/packages/billboard/billboard.pdf

(e) Download the data:

songs_data <- spotify_track_data

(f) Perform some basic operations:

As in the previous exercise, we compute some basic statistic operations on the datas, grouping the results in a matrix.

songs_data <- subset(songs_data, year >= 2000 & year < 2013, select = c(year, artist_name, track_name, explicit, danceability, energy, loudness, speechiness, duration_ms,valence))

list_mean <- c(mean(songs_data$danceability), mean(songs_data$energy),
               mean(songs_data$loudness), mean(songs_data$speechiness),
               mean(songs_data$duration_ms), mean(songs_data$valence))

list_std <- c(sd(songs_data$danceability), sd(songs_data$energy),
               sd(songs_data$loudness), sd(songs_data$speechiness),
               sd(songs_data$duration_ms), sd(songs_data$valence))

list_min <- c(min(songs_data$danceability), min(songs_data$energy),
               min(songs_data$loudness), min(songs_data$speechiness),
               min(songs_data$duration_ms), min(songs_data$valence))

list_max <- c(max(songs_data$danceability), max(songs_data$energy),
               max(songs_data$loudness), max(songs_data$speechiness),
               max(songs_data$duration_ms), max(songs_data$valence))

songs_statistics <- cbind(list_mean, list_std, list_min, list_max)

colnames(songs_statistics) <- c("Mean", "St Dev", "Min", "Max")
rownames(songs_statistics) <- c("Danceability", "Energy", "Loudness", "Speechiness", "Duration", "Valence")
songs_statistics
##                       Mean       St Dev        Min        Max
## Danceability  6.564914e-01 1.424895e-01     0.2090      0.973
## Energy        7.094308e-01 1.495904e-01     0.1320      0.991
## Loudness     -5.584553e+00 1.894350e+00   -14.0010     -0.166
## Speechiness   1.093427e-01 1.031828e-01     0.0236      0.576
## Duration      2.412545e+05 3.738438e+04 78200.0000 448573.000
## Valence       5.697276e-01 2.228602e-01     0.0395      0.967

Prepare data for the plot and plot

Finally we decided to analyze the datas performing a linear regression. As parameters we choose the variables ‘loudness’ and ‘energy’, and decided to discover how much the first value can influence and forecast the second one. As shown in the regression graph, we can say that there is a good linear relationship between the variables.

# Performing an ols regression model in sample and test it out of sample
vector_of_ones <- rep(1, length(songs_data$energy))

X_loudness <- cbind(vector_of_ones, songs_data$loudness)

colnames(X_loudness) <- c("Intercept", "Loudness")

Y <- songs_data$energy

ols_estimator_loudness <- inv(t(X_loudness) %*% X_loudness) %*% t(X_loudness) %*% Y
regression_loudness <- X_loudness %*% ols_estimator_loudness

plot(songs_data$loudness, songs_data$energy, xlab = "Loudness", ylab = "Energy")
lines(songs_data$loudness, regression_loudness, type = "l", col = "red",lwd = 2.5)

(g) Identifying the nature of the dataset:

The dataset has a wide format because for every variable there is a column containing the value for each song.