Introduction

The aim of this project is to prepare, evaluate and analyse stock market data and recommend an optimal portfolio consisting of two stocks, We will apply both statistical and mathematical principles.The original question we are trying to solve can be seen in this Link

We have three specific stocks for analysis which are Comcast corp (CMCSA),International Paper Co (IP) and Exxon Mobil Corp (XOM)

Loading necessary libraries

# packages
library(tidyquant) # for importing stock data

library(tidyverse)   # for working with data
 library(broom)     # for tidying output from various statistical procedures
library(knitr)       # for tables
library(kableExtra)  # for improving the appearance of tables
library(ggplot2)
library(psych)
library(dplyr)

Data Overview

## The beginnig date is January 01,2000

## The ending date is the date I knitted and submitted the project

 StockData<-c("CMCSA", "IP","XOM") %>%
   tq_get(get = "stock.prices", from = "2000-01-01")%>%
  select(symbol, date, adjusted)

## output the first 6 rows of your data frame:

head(StockData, n = 6 )%>%
  kable(caption = "Stock Market Data")
Stock Market Data
symbol date adjusted
CMCSA 2000-01-03 11.27721
CMCSA 2000-01-04 10.43293
CMCSA 2000-01-05 10.16155
CMCSA 2000-01-06 10.55354
CMCSA 2000-01-07 10.22186
CMCSA 2000-01-10 11.53351
## The beginning date is January 01, 2000

StockData <- c("CMCSA", "IP", "XOM") %>%
  tq_get(get = "stock.prices", from = "2000-01-01") %>%
  select(symbol, date, adjusted)

## Filter and combine the data for CMCSA, IP, and XOM
CombinedData <- StockData %>%
  filter(symbol %in% c("CMCSA", "IP", "XOM")) %>%
  group_by(symbol) %>%
  slice(1:3) %>%
  ungroup()

## Output the combined data
CombinedData %>%
  kable(caption = "Combined Stock Market Data")
Combined Stock Market Data
symbol date adjusted
CMCSA 2000-01-03 11.27721
CMCSA 2000-01-04 10.43293
CMCSA 2000-01-05 10.16155
IP 2000-01-03 22.41751
IP 2000-01-04 21.86212
IP 2000-01-05 22.61948
XOM 2000-01-03 18.32869
XOM 2000-01-04 17.97762
XOM 2000-01-05 18.95770

The Analysis

Plotting prices over time

When plotting a price over time, you’re essentially creating a time series plot.

  • Time Series Plot: is a graphical representation that shows how a particular variable, such as stock price, changes over time. It’s a useful tool for observing trends, patterns, and potential anomalies in data.

These equations can vary depending on the specific characteristics of the time series and the modeling approach used. Here are some common types of time series equations:

  1. Autoregressive (AR) Model: In an autoregressive model of order p, denoted AR(p), the value of the time series at any given time depends linearly on its previous p values, plus some random noise.

    Equation: \[X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + ... + \phi_p X_{t-p} + \varepsilon_t\]

    Where:

    • \(X_t\) is the value of the time series at time \(t\).
    • \(c\) is a constant.
    • \(\phi_1, \phi_2, ..., \phi_p\) are the parameters of the model.
    • \(\varepsilon_t\) is white noise (random error) at time \(t\).
  2. Moving Average (MA) Model: In a moving average model of order q, denoted MA(q), the value of the time series at any given time depends linearly on the current and previous q random errors.

    Equation: \[X_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]

    Where:

    • \(X_t\) is the value of the time series at time \(t\).
    • \(\mu\) is the mean of the time series.
    • \(\varepsilon_t, \varepsilon_{t-1}, ..., \varepsilon_{t-q}\) are white noise terms at times \(t, t-1, ..., t-q\).
    • \(\theta_1, \theta_2, ..., \theta_q\) are the parameters of the model.
  3. Autoregressive Moving Average (ARMA) Model: The ARMA model combines both autoregressive and moving average components.

    Equation: \[X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + ... + \phi_p X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]

    Where the terms are similar to those in the AR and MA models.

  4. Autoregressive Integrated Moving Average (ARIMA) Model: The ARIMA model includes differencing of the time series to make it stationary.

    Equation: \[X_t = c + \phi_1 \Delta X_{t-1} + \phi_2 \Delta X_{t-2} + ... + \phi_p \Delta X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + ... + \theta_q \varepsilon_{t-q}\]

    Where \(\Delta X_t = X_t - X_{t-1}\) is the differenced series.

These equations are fundamental in time series analysis and forecasting, and they serve as the basis for more advanced models and techniques.

Plot the prices of each asset over time separately.

# Convert date to proper date format
StockData$date <- as.Date(StockData$date)

# Plotting prices over time for each asset separately
ggplot(StockData, aes(x = date, y = adjusted, color = symbol)) +
  geom_line() +
  facet_wrap(~symbol, scales = "free_y", ncol = 1) +
  labs(title = "Stock Prices Over Time",
       x = "Date",
       y = "Adjusted Price",
       color = "Asset") +
  theme_minimal()

## CMCSA

The stock price for CMCSA has shown a steady increase over time, with minor fluctuations. It started below 20 and has risen to above 40. Overall, CMCSA has experienced positive growth, although there have been occasional dips.

IP

IP’s stock price remained relatively stable until around 2015, where it experienced a significant drop but recovered partially afterwards. The stability in the early years contrasts with the more volatile period around 2015.

XOM

XOM’s stock price increased steadily until around 2014, after which it faced a decline, hitting its lowest point around 2020 before starting to recover. The decline in XOM’s stock price during the mid-period is noteworthy, but recent signs of recovery indicate a potential upward trend.

Calculating returns and plot returns over time

The daily percentage returns of each asset uses below formula: \[ r_t = 100*\ln\Big(\frac{P_t}{P_{t-1}}\Big) \] Where \(P_t\) is the asset price at time \(t\). Then plot the returns for each asset over time.

# Calculate daily percentage returns for each asset
StockData <- StockData %>%
  group_by(symbol) %>%
  mutate(daily_return = 100 * log(adjusted / lag(adjusted)))

# Plotting daily percentage returns over time for each asset
ggplot(StockData, aes(x = date, y = daily_return, color = symbol)) +
  geom_line() +
  labs(title = "Daily Percentage Returns",
       x = "Date",
       y = "Percentage Returns",
       color = "Asset") +
  theme_minimal()

# Calculate daily percentage returns
StockData <- StockData %>%
  arrange(symbol, date) %>%
  group_by(symbol) %>%
  mutate(daily_return = 100 * log(adjusted / lag(adjusted))) %>%
  ungroup()

# Plotting returns over time for each asset separately
ggplot(StockData, aes(x = date, y = daily_return, color = symbol)) +
  geom_line() +
  facet_wrap(~symbol, scales = "free_y", ncol = 1) +
  labs(title = "Daily Percentage Returns Over Time",
       x = "Date",
       y = "Daily Percentage Returns",
       color = "Asset") +
  theme_minimal()

Histogram of returns

Creating a histogram for each of the returns series.

# Calculate the square root of the total number of data points
n_bins <- ceiling(sqrt(nrow(StockData)))

# Plotting histograms of daily percentage returns for each asset
ggplot(StockData, aes(x = daily_return, fill = symbol)) +
  geom_histogram(bins = n_bins, color = "black", alpha = 0.5, position = "identity") +
  facet_wrap(~symbol, ncol = 1) +
  labs(title = "Histogram of Daily Percentage Returns",
       x = "Daily Percentage Returns",
       y = "Frequency",
       fill = "Asset") +
  theme_minimal()

Explanation for the choice of bins:

The square root rule is a simple and widely used method for determining the number of bins in a histogram. It balances the trade-off between having too few bins, which can obscure underlying patterns in the data, and having too many bins, which can result in noise and make interpretation difficult. By taking the square root of the total number of data points, we get a reasonable approximation of the optimal number of bins that can effectively represent the distribution of the data without oversimplifying or overcomplicating it.

Summary table of returns

Reporting the descriptive statistics in a single table which includes the mean, median, variance, standard deviation, skewness and kurtosis for each series.

# Computing descriptive statistics for each series
summary_stats <- StockData %>%
  group_by(symbol) %>%
  summarise(
    Mean = mean(daily_return, na.rm= TRUE),
    Median = median(daily_return, na.rm=TRUE),
    Variance = var(daily_return,na.rm= TRUE),
    Std_Deviation = sd(daily_return, na.rm=TRUE),
    Skewness = skewness(daily_return),
    Kurtosis = kurtosis(daily_return)
  )

# Create a table of descriptive statistics
summary_table <- as.data.frame(t(summary_stats[-1]))
colnames(summary_table) <- summary_stats$symbol

# Add descriptive statistics names as row names
row.names(summary_table) <- c("Mean", "Median", "Variance", "Std Deviation", "Skewness", "Kurtosis")

# Print the summary table
summary_table

From the provided descriptive statistics for each asset’s daily percentage returns:

  1. Mean and Median:
    • The mean daily percentage returns range from 0.0094 to 0.0305.
    • The median daily percentage returns range from 0.0245 to 0.0431.
    • Median values are generally higher than mean values, indicating potential right-skewness in the distributions.
  2. Variance and Standard Deviation:
    • Variance measures the spread of data points around the mean.
    • Standard deviation provides a measure of the dispersion of data points from the mean.
    • XOM has the lowest variance and standard deviation, indicating relatively lower volatility compared to the other assets.
  3. Skewness:
    • Skewness measures the asymmetry of the distribution.
    • All assets have skewness close to zero, indicating approximately symmetric distributions.
  4. Kurtosis:
    • Kurtosis measures the peakedness or flatness of the distribution.
    • All assets have kurtosis values significantly higher than 3, suggesting heavier tails and more peaked distributions compared to a normal distribution.

Hypothesis Testing

Under the assumption that the returns of each asset are drawn from an independently and identically distributed normal distribution, are the expected returns of each asset statistically different from zero at the 1% level of significance?

  • Null Hypothesis (\(H_0\)): The expected return of each asset is equal to zero.

    • \(H_0: \mu = 0\)
  • Alternative Hypothesis (\(H_1\)): The expected return of each asset is not equal to zero.

    • \(H_1: \mu \neq 0\)
  • The level of significance, denoted as \(\alpha\), is set to 0.01 (1%).

  • Since we are testing the population mean of normally distributed data and the sample size is relatively large (assuming it meets the conditions for a z-test), we can use the z-test statistic.

  • The equation for the z-test statistic for testing the population mean (\(\mu\)) when the population standard deviation (\(\sigma\)) is known is: \[Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

    • \(\bar{x}\) is the sample mean,
    • \(\mu\) is the hypothesized population mean under the null hypothesis,
    • \(\sigma\) is the population standard deviation (assumed to be known),
    • \(n\) is the sample size.
  • Since we’re conducting a two-tailed test at the 1% significance level (\(\alpha = 0.01\)), we need to find the critical z-value(s) that correspond to the tails of the standard normal distribution.

  • The critical z-values are obtained from the standard normal distribution table or using statistical software. For a two-tailed test at \(\alpha = 0.01\), the critical z-values are approximately -2.576 and 2.576.

  • If the calculated z-test statistic falls within the critical region (i.e., outside the critical z-values), we reject the null hypothesis and conclude that the expected return of the asset is statistically different from zero at the 1% level of significance.

# Pairwise comparison of mean returns for each asset
pairwise_tests <- pairwise.t.test(StockData$daily_return, StockData$symbol, p.adjust.method = "bonferroni")

# Extract relevant information from the pairwise tests
test_results <- tidy(pairwise_tests)

# Print the hypothesis testing results
test_results %>%
  kable(caption = "Pairwise Comparison of Mean Returns")
Pairwise Comparison of Mean Returns
group1 group2 p.value
IP CMCSA 1
XOM CMCSA 1
XOM IP 1

Based on the pairwise comparison of mean returns:

  1. The mean returns of IP and CMCSA are not statistically different from each other at the 1% level of significance (p-value = 1).
  2. The mean returns of XOM and CMCSA are not statistically different from each other at the 1% level of significance (p-value = 1).
  3. The mean returns of XOM and IP are not statistically different from each other at the 1% level of significance (p-value = 1).
anova_results <- StockData %>%
  group_by(symbol) %>%
  summarise(mean_daily_return = mean(daily_return, na.rm = TRUE)) %>%
  aov(mean_daily_return ~ symbol, data = .) %>%
  tidy()

# Print ANOVA test results
anova_results %>%
  kable(caption = "ANOVA Test Results")
ANOVA Test Results
term df sumsq meansq
symbol 2 0.0002321 0.000116

Correlations

Refers to the degree to which two or more variables are related or move together. It measures the strength and direction of the linear relationship between variables.

Correlation between two variables, \(X\) and \(Y\) is described by the Pearson Correlation Coefficient:

\[ \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} \] where \(cov(X,Y)\) refers to the covariance of \(X\) and \(Y\).

Correlation within our sample is described by summary statistic, \(\rho\) (the Pearson correlation coefficient):

\[ \hat{\rho} = \frac{\overbrace{\sum_{i = 1}^n(x_i - \bar{x})(y_i - \bar{y})}^{\text{sample covariance}}}{\underbrace{\sqrt{\sum_{i = 1}^n (x_i - \bar{x})^2}}_{\text{sample 1 standard deviation }}\underbrace{\sqrt{\sum_{i = 1}^n (y_i - \bar{y})^2}}_{\text{ sample 2 standard deviation}}} \]

# Ensure daily_return is numeric
StockData$daily_return <- as.numeric(StockData$daily_return)

# Calculate correlation matrix
correlation_matrix <- StockData %>%
  select(symbol, date, daily_return) %>%
  spread(symbol, daily_return) %>%
  select(-date) %>%
  cor(use = "pairwise.complete.obs")

# Print correlation matrix
correlation_matrix %>%
  kable(caption = "Correlation Matrix of Daily Returns")
Correlation Matrix of Daily Returns
CMCSA IP XOM
CMCSA 1.0000000 0.4143364 0.3999493
IP 0.4143364 1.0000000 0.4459397
XOM 0.3999493 0.4459397 1.0000000

The correlation matrix of daily returns provides a summary of the pairwise correlations between the daily returns of the assets in your dataset. Each cell in the matrix represents the correlation coefficient between two assets’ daily returns.

  • CMCSA:

    • The correlation coefficient between CMCSA’s daily returns and IP’s daily returns is approximately 0.414. This positive value indicates a moderate positive correlation between the daily returns of CMCSA and IP.
    • The correlation coefficient between CMCSA’s daily returns and XOM’s daily returns is approximately 0.400. Similarly, this positive value indicates a moderate positive correlation between the daily returns of CMCSA and XOM.
  • IP:

    • The correlation coefficient between IP’s daily returns and CMCSA’s daily returns is approximately 0.414, as mentioned earlier.
    • The correlation coefficient between IP’s daily returns and XOM’s daily returns is approximately 0.446, indicating a moderate positive correlation between the daily returns of IP and XOM.
  • XOM:

    • The correlation coefficient between XOM’s daily returns and CMCSA’s daily returns is approximately 0.400, as mentioned earlier.
    • The correlation coefficient between XOM’s daily returns and IP’s daily returns is approximately 0.446, as mentioned earlier.

Advising an investor

Suppose that an investor has asked you to assist them in choosing two of these three stocks to include in their portfolio. The portfolio is defined by

\[ r = w_1r_1 + w_2r_2 \] Where \(r_1\) and \(r_2\) represent the returns from the first and second stock, respectively, and \(w_1\) and \(w_2\) represent the proportion of the investment placed in each stock. The entire investment is allocated between the two stocks, so \(w_1+w_2=1\).

The investor favours the combination of stocks that provides the highest return, but dislikes risk. Thus the investor’s happiness is a function of the portfolio, \(r\): \[ h(r) = \mathbb E(r) - \mathbb Var(r) \]

Where \(\mathbb E(r)\) is the expected return of the portfolio, and \(\mathbb Var(r)\) is the variance of the portfolio.1

Given your values for \(\mathbb E(r_1)\), \(\mathbb E(r_2)\), \(\mathbb Var(r_1)\), \(\mathbb Var(r_2)\) and \(\mathbb Cov(r_1, r_2)\) which portfolio would you recommend to the investor? What is the expected return to this portfolio?

To recommend the optimal portfolio to the investor, we’ll calculate the expected return and variance for each possible combination of the two stocks. Then, we’ll determine the portfolio that maximizes the investor’s happiness function.

Let Our goal is to maximize the happiness function \(h(r)\) with respect to \(w_1\) and \(w_2\), subject to the constraint \(w_1 + w_2 = 1\).

Let’s calculate the expected return and variance for each possible combination of the two stocks, and then find the optimal portfolio.

To be Continued


  1. Note that \(\mathbb E(r) = w_1 E(r_1) + w_2 \mathbb E(r_2)\), and \(\mathbb Var(r) = w_1^2\mathbb Var(r_1) + w_2^2 \mathbb Var(r_2) + 2w_1w_2 \mathbb Cov (r_1,r_2)\)↩︎