Quandl is an open data provider that offers thousands of datasets. These datasets can be imported directly in R through functions contained in the Quandl package.
There are a lot of datasets in Quandl referred to stock price. In these dataset, for every trading day, Quandl collects the date and different information. Particulary these information are:
All these informations are arranged by column in a (n by 10) matrix (where n is the number of observation).
To acces the dataset we need to install the Quandl package and recall it with the library function
#install.packages("Quandl", repos = "http://cran.us.r-project.org")
library(Quandl)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
The aim of our work is computing the rate of return of the Apple and Microsoft stocks and the rate of return of an equal weighted portfolio of these two stocks. We then want to plot the return on 100 dollars invested in each stock and in the portfolio.
To carry out this task we subsetted the two dataframes of the stocks, keeping only these variables:
Variable 1: date
Variable 2: adj_close apple, that we called ‘Apple’ in the dataset
Variable 3: adj_close msft, that we called ‘Msft’ in the dataset
The adjusted close price is the last trading price at which the stock traded during the day adjusted for all applicable splits and dividends distributions.
A more general explaination of the codes and dataframes obtained from Quandl is avaiable at this link:
Quandl package:
Apple:
Microsoft:
Adjusted data:
In Quandl, every dataset is identified by a unique id. This unique id can be found on the Quandl website as the Quandl Code. You can use this ID to access your dataset in R. For example: Get the Apple or Microsoft dataframes (Quandl code is “EOD/AAPL” and “EOD/MSFT”).
The Quandl package is able to return data in different formats. We choose to use the dataframe format, obtained by specifing “raw” in the input “type” in the Quandl() function.
In order to get access to the data, is necessary to insert the personal key which can be obtained in the website after the registration.
apple <- Quandl("EOD/AAPL", api_key = "WSfEZRAC2EX_nHdvizCq", type="raw")
head(apple)
## Date Open High Low Close Volume Dividend Split Adj_Open
## 1 2017-12-28 171.00 171.850 170.480 171.08 16480187 0 1 165.9712
## 2 2017-12-27 170.10 170.780 169.710 170.60 21498213 0 1 165.0977
## 3 2017-12-26 170.80 171.470 169.679 170.57 33185536 0 1 165.7771
## 4 2017-12-22 174.68 175.424 174.500 175.01 16349444 0 1 169.5430
## 5 2017-12-21 174.17 176.020 174.100 175.01 20949896 0 1 169.0480
## 6 2017-12-20 174.87 175.420 173.250 174.35 23475649 0 1 169.7274
## Adj_High Adj_Low Adj_Close Adj_Volume
## 1 166.7962 165.4665 166.0489 16480187
## 2 165.7577 164.7191 165.5830 21498213
## 3 166.4274 164.6891 165.5539 33185536
## 4 170.2651 169.3683 169.8633 16349444
## 5 170.8436 168.9800 169.8633 20949896
## 6 170.2612 168.1550 169.2227 23475649
msft <- Quandl("EOD/MSFT", api_key="WSfEZRAC2EX_nHdvizCq", type = "raw")
head(msft)
## Date Open High Low Close Volume Dividend Split Adj_Open
## 1 2017-12-28 85.90 85.9300 85.550 85.72 10594344 0 1 83.48228
## 2 2017-12-27 85.65 85.9800 85.215 85.71 14678025 0 1 83.23932
## 3 2017-12-26 85.31 85.5346 85.030 85.40 9891237 0 1 82.90889
## 4 2017-12-22 85.40 85.6300 84.920 85.51 14145841 0 1 82.99636
## 5 2017-12-21 86.05 86.1000 85.400 85.50 17990745 0 1 83.62806
## 6 2017-12-20 86.20 86.3000 84.710 85.52 23674931 0 1 83.77384
## Adj_High Adj_Low Adj_Close Adj_Volume
## 1 83.51144 83.14214 83.30735 10594344
## 2 83.56003 82.81656 83.29763 14678025
## 3 83.12717 82.63677 82.99636 9891237
## 4 83.21988 82.52987 83.10326 14145841
## 5 83.67666 82.99636 83.09354 17990745
## 6 83.87103 82.32578 83.11298 23674931
dataset <- data.frame(Date = apple$Date, Apple = apple$Adj_Close, Msft = msft$Adj_Close)
head(dataset)
## Date Apple Msft
## 1 2017-12-28 166.0489 83.30735
## 2 2017-12-27 165.5830 83.29763
## 3 2017-12-26 165.5539 82.99636
## 4 2017-12-22 169.8633 83.10326
## 5 2017-12-21 169.8633 83.09354
## 6 2017-12-20 169.2227 83.11298
To show the data structure and some meaningful values of the dataset, we used the functions summary() and str()
summary(dataset[,-1])
## Apple Msft
## Min. : 57.53 Min. :27.11
## 1st Qu.: 89.57 1st Qu.:38.83
## Median :104.05 Median :46.20
## Mean :106.41 Mean :48.85
## 3rd Qu.:118.47 3rd Qu.:57.61
## Max. :171.23 Max. :84.41
str(dataset)
## 'data.frame': 1090 obs. of 3 variables:
## $ Date : Date, format: "2017-12-28" "2017-12-27" ...
## $ Apple: num 166 166 166 170 170 ...
## $ Msft : num 83.3 83.3 83 83.1 83.1 ...
After taking a look at the Apple dataframe we notice a significant change in the stock price, caused by a split of the shares. We then decided to find when it occourred and divide the dataset into two periods: before and after the split.
# creating subset
split_date <- subset(apple, Split != 1, select = Date) # finding the split date
split_date
## Date
## 898 2014-06-09
pre_split_dataset <- subset(dataset, as.numeric(Date) < as.numeric(split_date))
post_split_dataset <- subset(dataset, as.numeric(Date) >= as.numeric(split_date))
To show how the two stocks and the portfolio performed in the period that we are studying, we compute the return of both.
# Apple return
apple_final_price <- dataset$Apple[-length(dataset$Apple)]
apple_initial_price <- dataset$Apple[-1]
apple_return_list <- (apple_final_price - apple_initial_price) / apple_initial_price
# Microsoft return
msft_return_list <- (dataset$Msft[-length(dataset$Msft)] - dataset$Msft[-1]) / dataset$Msft[-length(dataset$Msft)]
# Equal weighted portfolio return
portfolio_return_list <- 0.50 * apple_return_list + 0.5* msft_return_list
# Collecting the return in a dataframe
return_dataframe <- data.frame(Date = dataset$Date[-length(dataset$Date)], Apple.Return = apple_return_list, Msft.Return = msft_return_list, Portfolio.Return = portfolio_return_list)
To have a better rapresentation of the range of values as well as the generic situation of the returns of the shares that we are analyzing, we compute some basic statistics.
Mean <- apply(return_dataframe[,-1], 2, mean)
Stdv <- apply(return_dataframe[,-1], 2, sd)
Min <- apply(return_dataframe[,-1], 2, min)
Max <- apply(return_dataframe[,-1], 2, max)
Corr <- cor(return_dataframe$Apple.Return, return_dataframe$Msft.Return)
statistics <- rbind(Mean,Stdv,Min,Max, Corr)
statistics
## Apple.Return Msft.Return Portfolio.Return
## Mean 0.001000488 0.0009166808 0.0009585843
## Stdv 0.014285053 0.0135815776 0.0116909252
## Min -0.079927339 -0.1019690577 -0.0684911601
## Max 0.081981896 0.0946312931 0.0612704854
## Corr 0.407674976 0.4076749756 0.4076749756
We have to set the returns in a chronological order, from the oldest to the latest, since the starting dataset is in the opposite order.
# Reverse the vectors of returns from the oldest to the latest return
apple_return_list <- rev(apple_return_list)
msft_return_list <- rev(msft_return_list)
Then we calculate the cumulative returns of both stocks and the portfolio, starting from a base value of 100 which is the initial investment, to see more clearly the trends and make them more comparable.
apple_cumulative_return_list <- c()
apple_cumulative_return <- 100
for(i in 1:length(apple_return_list)){
apple_cumulative_return <- apple_cumulative_return * (1+apple_return_list[i])
apple_cumulative_return_list <- c(apple_cumulative_return_list, apple_cumulative_return)
}
msft_cumulative_return_list <- c()
msft_cumulative_return <- 100
for(i in 1:length(msft_return_list)){
msft_cumulative_return <- msft_cumulative_return * (1+msft_return_list[i])
msft_cumulative_return_list <- c(msft_cumulative_return_list, msft_cumulative_return)
}
pf_cumulative_return_list <- c()
pf_cumulative_return <- 100
for(i in 1:length(portfolio_return_list)){
pf_cumulative_return <- pf_cumulative_return * (1+portfolio_return_list[i])
pf_cumulative_return_list <- c(pf_cumulative_return_list, pf_cumulative_return)
}
Lastly we plot the returns of the stocks and the portfolio.
plot(rev(apple$Date[-1]), apple_cumulative_return_list , "l", xlab = "Date", ylab = "Returns")
lines(rev(apple$Date[-1]), msft_cumulative_return_list, "l", col = "red")
lines(rev(apple$Date[-1]), pf_cumulative_return_list, "l", col = "blue", lwd = 1.7)
legend(x = "topleft", legend = c("AAPL", "MSFT", "Equal weighted portfolio"), lty = 1, col = c("black", "red", "blue"))
The native format of our dataset is wide becuase we have for every observation (one day) 2 different variables for the prices of the two different stock. In order to transform the format from wide to long we have to create a column containing all the values and another column listing the context of the value.
#install.packages("tidyr")
library(tidyr) # we use the gather function which is in the tidyr library
dataset_long <- gather(dataset, Company, Price, -Date)
dataset_long <- dataset_long[order(dataset_long$Date, decreasing = TRUE),]
head(dataset)
## Date Apple Msft
## 1 2017-12-28 166.0489 83.30735
## 2 2017-12-27 165.5830 83.29763
## 3 2017-12-26 165.5539 82.99636
## 4 2017-12-22 169.8633 83.10326
## 5 2017-12-21 169.8633 83.09354
## 6 2017-12-20 169.2227 83.11298
head(dataset_long)
## Date Company Price
## 1 2017-12-28 Apple 166.04885
## 1091 2017-12-28 Msft 83.30735
## 2 2017-12-27 Apple 165.58297
## 1092 2017-12-27 Msft 83.29763
## 3 2017-12-26 Apple 165.55385
## 1093 2017-12-26 Msft 82.99636
We find a dataframe about Billboard 100 top song’s. We retrived it from
URL:
and the datas are originally taken from Spotify, here
URL:
This dataset contains informations regarding songs on the Billboard Hot 100 list from 1960 to 2016. The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States. (From: Wikipedia, the free encyclopedia, 10 November 2019,
URL:
We installed also the package matlib in order to perform some matrix calculation which will be usefull, later on the exercise, to create a linear regression model between two variables.
#install.packages("billboard", repos = "http://cran.us.r-project.org")
#install.packages("matlib", repos = "http://cran.us.r-project.org")
library(billboard)
library(matlib)
The data frame is composed by 5497 rows and 23 variables, but we choose to utilize only 10 variables and only 1272 rows (years from 2000 till the end of 2012) The variables that we are using are:
year = year
artist_name = the artist of the song
explicit = if the track is rated as explicit
track_name = the name of the track
danceability = Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy = Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
loudness = The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode = Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness = Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
valence = A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
duration_ms = The duration of the track in milliseconds.
for the complete informations about all the variables in the dataset consult:
songs_data <- spotify_track_data
As in the previous exercise, we compute some basic statistic operations on the datas, grouping the results in a matrix.
songs_data <- subset(songs_data, year >= 2000 & year < 2013, select = c(year, artist_name, track_name, explicit, danceability, energy, loudness, speechiness, duration_ms,valence))
list_mean <- c(mean(songs_data$danceability), mean(songs_data$energy),
mean(songs_data$loudness), mean(songs_data$speechiness),
mean(songs_data$duration_ms), mean(songs_data$valence))
list_std <- c(sd(songs_data$danceability), sd(songs_data$energy),
sd(songs_data$loudness), sd(songs_data$speechiness),
sd(songs_data$duration_ms), sd(songs_data$valence))
list_min <- c(min(songs_data$danceability), min(songs_data$energy),
min(songs_data$loudness), min(songs_data$speechiness),
min(songs_data$duration_ms), min(songs_data$valence))
list_max <- c(max(songs_data$danceability), max(songs_data$energy),
max(songs_data$loudness), max(songs_data$speechiness),
max(songs_data$duration_ms), max(songs_data$valence))
songs_statistics <- cbind(list_mean, list_std, list_min, list_max)
colnames(songs_statistics) <- c("Mean", "St Dev", "Min", "Max")
rownames(songs_statistics) <- c("Danceability", "Energy", "Loudness", "Speechiness", "Duration", "Valence")
songs_statistics
## Mean St Dev Min Max
## Danceability 6.564914e-01 1.424895e-01 0.2090 0.973
## Energy 7.094308e-01 1.495904e-01 0.1320 0.991
## Loudness -5.584553e+00 1.894350e+00 -14.0010 -0.166
## Speechiness 1.093427e-01 1.031828e-01 0.0236 0.576
## Duration 2.412545e+05 3.738438e+04 78200.0000 448573.000
## Valence 5.697276e-01 2.228602e-01 0.0395 0.967
Finally we decided to analyze the datas performing a linear regression. As parameters we choose the variables ‘loudness’ and ‘energy’, and decided to discover how much the first value can influence and forecast the second one. As shown in the regression graph, we can say that there is a good linear relationship between the variables.
# Performing an ols regression model in sample and test it out of sample
vector_of_ones <- rep(1, length(songs_data$energy))
X_loudness <- cbind(vector_of_ones, songs_data$loudness)
colnames(X_loudness) <- c("Intercept", "Loudness")
Y <- songs_data$energy
ols_estimator_loudness <- inv(t(X_loudness) %*% X_loudness) %*% t(X_loudness) %*% Y
regression_loudness <- X_loudness %*% ols_estimator_loudness
plot(songs_data$loudness, songs_data$energy, xlab = "Loudness", ylab = "Energy")
lines(songs_data$loudness, regression_loudness, type = "l", col = "red",lwd = 2.5)
The dataset has a wide format because for every variable there is a column containing the value for each song.