The aim of this project is to prepare, evaluate and analyse stock market data and recommend an optimal portfolio consisting of two stocks, We will apply both statistical and mathematical principles.The original question we are trying to solve can be seen in this Link
We have three specific stocks for analysis which are Comcast corp (CMCSA),International Paper Co (IP) and Exxon Mobil Corp (XOM)
Loading necessary libraries
# packages
library(tidyquant) # for importing stock data
library(tidyverse) # for working with data
library(broom) # for tidying output from various statistical procedures
library(knitr) # for tables
library(kableExtra) # for improving the appearance of tables
library(ggplot2)
library(psych)
library(dplyr)
## The beginnig date is January 01,2000
## The ending date is the date I knitted and submitted the project
StockData<-c("CMCSA", "IP","XOM") %>%
tq_get(get = "stock.prices", from = "2000-01-01")%>%
select(symbol, date, adjusted)
## output the first 6 rows of your data frame:
head(StockData, n = 6 )%>%
kable(caption = "Stock Market Data")
symbol | date | adjusted |
---|---|---|
CMCSA | 2000-01-03 | 11.27721 |
CMCSA | 2000-01-04 | 10.43293 |
CMCSA | 2000-01-05 | 10.16155 |
CMCSA | 2000-01-06 | 10.55354 |
CMCSA | 2000-01-07 | 10.22186 |
CMCSA | 2000-01-10 | 11.53351 |
## The beginning date is January 01, 2000
StockData <- c("CMCSA", "IP", "XOM") %>%
tq_get(get = "stock.prices", from = "2000-01-01") %>%
select(symbol, date, adjusted)
## Filter and combine the data for CMCSA, IP, and XOM
CombinedData <- StockData %>%
filter(symbol %in% c("CMCSA", "IP", "XOM")) %>%
group_by(symbol) %>%
slice(1:3) %>%
ungroup()
## Output the combined data
CombinedData %>%
kable(caption = "Combined Stock Market Data")
symbol | date | adjusted |
---|---|---|
CMCSA | 2000-01-03 | 11.27721 |
CMCSA | 2000-01-04 | 10.43293 |
CMCSA | 2000-01-05 | 10.16155 |
IP | 2000-01-03 | 22.41751 |
IP | 2000-01-04 | 21.86212 |
IP | 2000-01-05 | 22.61948 |
XOM | 2000-01-03 | 18.32869 |
XOM | 2000-01-04 | 17.97762 |
XOM | 2000-01-05 | 18.95770 |
When plotting a price over time, you’re essentially creating a time series plot.
These equations can vary depending on the specific characteristics of the time series and the modeling approach used. Here are some common types of time series equations:
Autoregressive (AR) Model: In an autoregressive model of order p, denoted AR(p), the value of the time series at any given time depends linearly on its previous p values, plus some random noise.
Equation: \[X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + ... + \phi_p X_{t-p} + \varepsilon_t\]
Where:
Moving Average (MA) Model: In a moving average model of order q, denoted MA(q), the value of the time series at any given time depends linearly on the current and previous q random errors.
Equation: \[X_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]
Where:
Autoregressive Moving Average (ARMA) Model: The ARMA model combines both autoregressive and moving average components.
Equation: \[X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + ... + \phi_p X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]
Where the terms are similar to those in the AR and MA models.
Autoregressive Integrated Moving Average (ARIMA) Model: The ARIMA model includes differencing of the time series to make it stationary.
Equation: \[X_t = c + \phi_1 \Delta X_{t-1} + \phi_2 \Delta X_{t-2} + ... + \phi_p \Delta X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]
Where \(\Delta X_t = X_t - X_{t-1}\) is the differenced series.
These equations are fundamental in time series analysis and forecasting, and they serve as the basis for more advanced models and techniques.
Plot the prices of each asset over time separately.
# Convert date to proper date format
StockData$date <- as.Date(StockData$date)
# Plotting prices over time for each asset separately
ggplot(StockData, aes(x = date, y = adjusted, color = symbol)) +
geom_line() +
facet_wrap(~symbol, scales = "free_y", ncol = 1) +
labs(title = "Stock Prices Over Time",
x = "Date",
y = "Adjusted Price",
color = "Asset") +
theme_minimal()
## CMCSA
The stock price for CMCSA has shown a steady increase over time, with minor fluctuations. It started below 20 and has risen to above 40. Overall, CMCSA has experienced positive growth, although there have been occasional dips.
IP’s stock price remained relatively stable until around 2015, where it experienced a significant drop but recovered partially afterwards. The stability in the early years contrasts with the more volatile period around 2015.
XOM’s stock price increased steadily until around 2014, after which it faced a decline, hitting its lowest point around 2020 before starting to recover. The decline in XOM’s stock price during the mid-period is noteworthy, but recent signs of recovery indicate a potential upward trend.
The daily percentage returns of each asset uses below formula: \[ r_t = 100*\ln\Big(\frac{P_t}{P_{t-1}}\Big) \] Where \(P_t\) is the asset price at time \(t\). Then plot the returns for each asset over time.
# Calculate daily percentage returns for each asset
StockData <- StockData %>%
group_by(symbol) %>%
mutate(daily_return = 100 * log(adjusted / lag(adjusted)))
# Plotting daily percentage returns over time for each asset
ggplot(StockData, aes(x = date, y = daily_return, color = symbol)) +
geom_line() +
labs(title = "Daily Percentage Returns",
x = "Date",
y = "Percentage Returns",
color = "Asset") +
theme_minimal()
# Calculate daily percentage returns
StockData <- StockData %>%
arrange(symbol, date) %>%
group_by(symbol) %>%
mutate(daily_return = 100 * log(adjusted / lag(adjusted))) %>%
ungroup()
# Plotting returns over time for each asset separately
ggplot(StockData, aes(x = date, y = daily_return, color = symbol)) +
geom_line() +
facet_wrap(~symbol, scales = "free_y", ncol = 1) +
labs(title = "Daily Percentage Returns Over Time",
x = "Date",
y = "Daily Percentage Returns",
color = "Asset") +
theme_minimal()
Histogram of returns
Creating a histogram for each of the returns series.
# Calculate the square root of the total number of data points
n_bins <- ceiling(sqrt(nrow(StockData)))
# Plotting histograms of daily percentage returns for each asset
ggplot(StockData, aes(x = daily_return, fill = symbol)) +
geom_histogram(bins = n_bins, color = "black", alpha = 0.5, position = "identity") +
facet_wrap(~symbol, ncol = 1) +
labs(title = "Histogram of Daily Percentage Returns",
x = "Daily Percentage Returns",
y = "Frequency",
fill = "Asset") +
theme_minimal()
Explanation for the choice of bins:
The square root rule is a simple and widely used method for determining the number of bins in a histogram. It balances the trade-off between having too few bins, which can obscure underlying patterns in the data, and having too many bins, which can result in noise and make interpretation difficult. By taking the square root of the total number of data points, we get a reasonable approximation of the optimal number of bins that can effectively represent the distribution of the data without oversimplifying or overcomplicating it.
Summary table of returns
Reporting the descriptive statistics in a single table which includes the mean, median, variance, standard deviation, skewness and kurtosis for each series.
# Computing descriptive statistics for each series
summary_stats <- StockData %>%
group_by(symbol) %>%
summarise(
Mean = mean(daily_return, na.rm= TRUE),
Median = median(daily_return, na.rm=TRUE),
Variance = var(daily_return,na.rm= TRUE),
Std_Deviation = sd(daily_return, na.rm=TRUE),
Skewness = skewness(daily_return),
Kurtosis = kurtosis(daily_return)
)
# Create a table of descriptive statistics
summary_table <- as.data.frame(t(summary_stats[-1]))
colnames(summary_table) <- summary_stats$symbol
# Add descriptive statistics names as row names
row.names(summary_table) <- c("Mean", "Median", "Variance", "Std Deviation", "Skewness", "Kurtosis")
# Print the summary table
summary_table
From the provided descriptive statistics for each asset’s daily percentage returns:
Under the assumption that the returns of each asset are drawn from an independently and identically distributed normal distribution, are the expected returns of each asset statistically different from zero at the 1% level of significance?
Null Hypothesis (\(H_0\)): The expected return of each asset is equal to zero.
Alternative Hypothesis (\(H_1\)): The expected return of each asset is not equal to zero.
The level of significance, denoted as \(\alpha\), is set to 0.01 (1%).
Since we are testing the population mean of normally distributed data and the sample size is relatively large (assuming it meets the conditions for a z-test), we can use the z-test statistic.
The equation for the z-test statistic for testing the population mean (\(\mu\)) when the population standard deviation (\(\sigma\)) is known is: \[Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\]
Since we’re conducting a two-tailed test at the 1% significance level (\(\alpha = 0.01\)), we need to find the critical z-value(s) that correspond to the tails of the standard normal distribution.
The critical z-values are obtained from the standard normal distribution table or using statistical software. For a two-tailed test at \(\alpha = 0.01\), the critical z-values are approximately -2.576 and 2.576.
If the calculated z-test statistic falls within the critical region (i.e., outside the critical z-values), we reject the null hypothesis and conclude that the expected return of the asset is statistically different from zero at the 1% level of significance.
# Pairwise comparison of mean returns for each asset
pairwise_tests <- pairwise.t.test(StockData$daily_return, StockData$symbol, p.adjust.method = "bonferroni")
# Extract relevant information from the pairwise tests
test_results <- tidy(pairwise_tests)
# Print the hypothesis testing results
test_results %>%
kable(caption = "Pairwise Comparison of Mean Returns")
group1 | group2 | p.value |
---|---|---|
IP | CMCSA | 1 |
XOM | CMCSA | 1 |
XOM | IP | 1 |
Based on the pairwise comparison of mean returns:
anova_results <- StockData %>%
group_by(symbol) %>%
summarise(mean_daily_return = mean(daily_return, na.rm = TRUE)) %>%
aov(mean_daily_return ~ symbol, data = .) %>%
tidy()
# Print ANOVA test results
anova_results %>%
kable(caption = "ANOVA Test Results")
term | df | sumsq | meansq |
---|---|---|---|
symbol | 2 | 0.0002321 | 0.000116 |
Refers to the degree to which two or more variables are related or move together. It measures the strength and direction of the linear relationship between variables.
Correlation between two variables, \(X\) and \(Y\) is described by the Pearson Correlation Coefficient:
\[ \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} \] where \(cov(X,Y)\) refers to the covariance of \(X\) and \(Y\).
Correlation within our sample is described by summary statistic, \(\rho\) (the Pearson correlation coefficient):
\[ \hat{\rho} = \frac{\overbrace{\sum_{i = 1}^n(x_i - \bar{x})(y_i - \bar{y})}^{\text{sample covariance}}}{\underbrace{\sqrt{\sum_{i = 1}^n (x_i - \bar{x})^2}}_{\text{sample 1 standard deviation }}\underbrace{\sqrt{\sum_{i = 1}^n (y_i - \bar{y})^2}}_{\text{ sample 2 standard deviation}}} \]
# Ensure daily_return is numeric
StockData$daily_return <- as.numeric(StockData$daily_return)
# Calculate correlation matrix
correlation_matrix <- StockData %>%
select(symbol, date, daily_return) %>%
spread(symbol, daily_return) %>%
select(-date) %>%
cor(use = "pairwise.complete.obs")
# Print correlation matrix
correlation_matrix %>%
kable(caption = "Correlation Matrix of Daily Returns")
CMCSA | IP | XOM | |
---|---|---|---|
CMCSA | 1.0000000 | 0.4143364 | 0.3999493 |
IP | 0.4143364 | 1.0000000 | 0.4459397 |
XOM | 0.3999493 | 0.4459397 | 1.0000000 |
The correlation matrix of daily returns provides a summary of the pairwise correlations between the daily returns of the assets in your dataset. Each cell in the matrix represents the correlation coefficient between two assets’ daily returns.
CMCSA:
IP:
XOM:
Suppose that an investor has asked you to assist them in choosing two of these three stocks to include in their portfolio. The portfolio is defined by
\[ r = w_1r_1 + w_2r_2 \] Where \(r_1\) and \(r_2\) represent the returns from the first and second stock, respectively, and \(w_1\) and \(w_2\) represent the proportion of the investment placed in each stock. The entire investment is allocated between the two stocks, so \(w_1+w_2=1\).
The investor favours the combination of stocks that provides the highest return, but dislikes risk. Thus the investor’s happiness is a function of the portfolio, \(r\): \[ h(r) = \mathbb E(r) - \mathbb Var(r) \]
Where \(\mathbb E(r)\) is the expected return of the portfolio, and \(\mathbb Var(r)\) is the variance of the portfolio.1
Given your values for \(\mathbb E(r_1)\), \(\mathbb E(r_2)\), \(\mathbb Var(r_1)\), \(\mathbb Var(r_2)\) and \(\mathbb Cov(r_1, r_2)\) which portfolio would you recommend to the investor? What is the expected return to this portfolio?
To recommend the optimal portfolio to the investor, we’ll calculate the expected return and variance for each possible combination of the two stocks. Then, we’ll determine the portfolio that maximizes the investor’s happiness function.
Let Our goal is to maximize the happiness function \(h(r)\) with respect to \(w_1\) and \(w_2\), subject to the constraint \(w_1 + w_2 = 1\).
Let’s calculate the expected return and variance for each possible combination of the two stocks, and then find the optimal portfolio.
To be Continued
Note that \(\mathbb E(r) = w_1 E(r_1) + w_2 \mathbb E(r_2)\), and \(\mathbb Var(r) = w_1^2\mathbb Var(r_1) + w_2^2 \mathbb Var(r_2) + 2w_1w_2 \mathbb Cov (r_1,r_2)\)↩︎