This research attempts at understanding the relation between the price of common shares, of companies in the S&P 500 Index, and their fundamental indicators of value. We believed that the price of common shares was more closely related to each company’s fundamental indicators instead of the technical indicators. However, for our research we employed the use of both such indicators of value and analyzed their individual relationships to price. Our exploratory analysis indicates that each variable has its own data distribution - with price, volume, and market cap being more log normally distributed then beta, which seems to have an approximately normal distribution. Upon model building and analysis we found that both fundamental as well as technical indicators can be used to estimate the price of common shares and that these indicators are reflective of this information in the estimated price of common shares for these companies. In our model, the fundamental indicators which show this relationship to price include Dividends, Earnings per Share, Price to Earnings ratio, Dividend yield, and Market Capitalizations whereas Volume is the technical indicator which in indicative of this relationship.
The purpose of our research is to conduct an anlysis of the value, as reflected in the price of common shares, of dividend paying companies in the S&P 500 Index. Our objective is to understand the relationship between the price of common shares and fundamental indicators of value. We plan on investigating and understanding this relationship in three sequential steps:
Over the years in the investment industry, value based investors have found significant success through various valuation techniques and methodologies. It is our intention to assess the validity of this phenomenon by analyzing whether if this information, through fundamental indicators, is truly indicated in common stock prices of pulic companies in S&P 500.
This dataset was collected from Data Hub by making custom filters to select the variables of interest for this research. This dataset is not licensed.
Dataset includes the following variables:
| Variable | Description |
|---|---|
| Sector | Sector/Industry to which the company belongs |
| Major Sector | Major Sector to which the company belongs |
| Price | Price of a common share |
| PEG | Price to earnings to growth ratio |
| Return on Assets (before taxes) | Return on assets for the company |
| 5 Year Return | 5 year return of common shares |
| % insider | Percentage of insider ownership |
| % institutional | Percentage of institutional ownership |
| % 5 Year dividend growth | 5 year dividend growth rate |
| % 5 Year earnings growth | 5 year earnings growth rate |
| % 5 Year revenue growth | 5 year revenue growth rate |
| Volume | Average volume traded for the stock |
| Dividend | Most recent dividend paid on the stock |
| Beta | Measure of volatility or market risk for the stock |
| Net Income | Most recent net income of the company |
| Price/Earnings | Price to earnings ratio of the stock |
| Dividend Yield | Dividend yield on the stock |
| Dividend Payout | Dividend payout for the stock |
| Earnings/Share | Earnings per share |
| Market Cap | Market capitalization of the company |
| EBITDA | Earnings Before Interest, Taxes, Depreciation, and Amortization |
| Price/Sales | Price to sales ratio of the stock |
| Price/Book | Price to book ratio of the stock |
data1 <- read_csv("SP500_Fundamentals.csv")
data2 <- data1[, c(5:25)]
data2 <- as.data.frame(lapply(data2[,c(1:20)], as.numeric))
sp500_train <- cbind(data1$Sector, data1$`Major Sector`, data2)
colnames(sp500_train) <- c("Sector","Major Sector", colnames(data2))
sp500_train <- na.omit(sp500_train)
sp500_train$`Major Sector`<- factor(sp500_train$`Major Sector`)
sp500_train$Sector<- factor(sp500_train$Sector)
sp500_train <- filter(sp500_train, sp500_train$Dividend != 0 &
sp500_train$Dividend.Yield != 0 &
sp500_train$EBITDA != 0 &
sp500_train$Price.Earnings >= 0)
sp500 <- sp500_train[c(2:300),c(1:22)]
#Test Observation
sp500_test_P <- sp500_train[1,c(3:22)]
Scatter Plots with Linear Line fit:
The following scatter plots show the distribution of each variable against price. The lines presented in each scatter plot show a linear least squares fit between price and the corresponding variable.
ggplot(aes(y=Price, x=PEG), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Return.on.Assets..Before.Tax.), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X5.Year.Return), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X..Insider), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X..Institutional), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X..5.year.Dividend.Growth), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X..5.year.Earnings.Growth), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=X..5.year.Revenue.Growth), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Volume), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Dividend), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Beta), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Net.Income), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Price.Earnings), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Dividend.Yield), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Divide.Payout), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Earnings.Share), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Market.Cap), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=EBITDA), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
ggplot(aes(y=Price, x=Price.Sales), data=sp500) +
geom_point() +
stat_smooth(method='lm', se=FALSE)
In examining the distributions of our variables, some are generally normally distributed, whereas many are right-skewed. The X..Institutional variable is left-skewed. There are many outliers present in the distributions as well which can potentially have a high leverage effect in a regression model. For the Major Sector variable, it is important to note that a large number of the data collected falls in the Secondary and Tertiary major sector. Beta is the most normally distributed variable.
ggplot(aes(x=PEG), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Price), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Return.on.Assets..Before.Tax.), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X5.Year.Return), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X..Insider), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X..Institutional), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X..5.year.Dividend.Growth), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X..5.year.Earnings.Growth), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=X..5.year.Revenue.Growth), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Volume), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)+
scale_x_continuous(labels = scales::comma)
ggplot(aes(x=Dividend), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Beta), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Net.Income), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
#scale_x_continuous(labels = scales::comma)
ggplot(aes(x=Price.Earnings), data=sp500) +
geom_histogram(color="black", fill="light green", bins=40)
ggplot(aes(x=Dividend.Yield), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Divide.Payout), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Earnings.Share), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Market.Cap), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=EBITDA), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=Price.Sales), data=sp500) +
geom_histogram(color="black", fill="light green", bins=12)
ggplot(aes(x=`Major Sector`), data=sp500) +
geom_histogram(color="black", fill="light green", stat = "count")
ggplot(aes(x=`Sector`), data=sp500) +
geom_histogram(color="black", fill="light green", stat = "count")+
coord_flip()
In analyzing the correlations between the variables in relation to Price, it is necessary to first acknowledge that the correlation matrix shows no strong correlations between any set of variables. Hence, there is no clear multicollinearity present. There are a few nagative correlations between Price and Dividened.Yield, Volume, PEG, and there are a few that seem to have a horizontal correlation, meaning they might not affect Price directly. These include the variables X..Insider, X..5.year.Earnings.Growth, and Divide.Payout. The variables which seem to show a positive correlation to price include Price.Sales, Market.Cap, Earnings.Share, Net.Income, Beta, X5.Year.Return, Return.on.Assets..Before.Tax,and Dividend. These variables have the strongest positive correlation with Price. The remaining variables have a slight positive relationship with Price, but one could argue they are close to a horizontal correlation. The only notable possible example of multicollinearity is between the variables Market.Cap, and EBITDA, showing a correlation value of approximately 0.8. Net.Income also seems to have a relatively strong correlation with Market.Cap and EBITDA. The following is a clustered plot of correlations between the variables in the dataset:
library(corrplot)
corrplot(cor(sp500[,c(3:21)],
method = "pearson"),
title = "Correlation Plot",
method = "circle",
outline = T,
order="hclust",
mar = c(0,0,0.95,0),
cl.pos = "b",
tl.col = "indianred4",
tl.cex = .6,
cl.cex = .6)
Major Sectors
Categorizing data by Major Sector shows that the all major sectors have medians relatively close to each other. However, Primary and Quarternary major sectors have a different data distribution in comparison to Secondary and Tertiary major sectors. Primary and Secondary sectors show the median values to be closer to the upper quartile which shows a negative skew, whereas Secondary and Tertiary sectors have their medians closer to the lower quartile showing a positive skew. Secondary and Tertiary sectors also have more outliers as compared to Primary and Quaternary
Sectors
Categorizing data into sectors shows minor variation of medians across sectors. The inter quartile ranges for all sectors seem relatively equal except fpr Telecommunication services, Real Estate, Industrials, and Financials. The real estate sector has the largest outlier, with matrials haveing and negative skew, and industrials having a positive skew.
ggplot(sp500, aes(x=sp500$`Major Sector`, y=sp500$`Price`)) +
stat_boxplot(geom='errorbar', width=0.5) +
geom_boxplot(outlier.size = 1, aes(fill= `Major Sector`)) +
labs(x="Major Sector", y="Price", title="Price by Major Sector")
ggplot(sp500, aes(x=sp500$`Sector`, y=sp500$Price)) +
stat_boxplot(geom='errorbar', width=0.5)+
geom_boxplot(outlier.size = 1, aes(fill= Sector)) +
labs(x = "Sector", y = "Price", title = "Price by Sector")+
coord_flip()
Model Selection and Data Transformation:
In building the regression model our objective was to maximize the Adjusted R2 value while minimizing the number of insignificant variables. To test the significance of each variable’s P-Value our Alpha was set at 0.05. In order to determine the best Adjusted R2 value with statistically significant P-Values we used backward elimination. While employing the use of this technique we eliminated insignificant variables until our Adjusted R2 was maximized and our model had statistically significant variables. The variables which were eliminated from the model include:
Observations:
Our model estimates the log normal value of the price of each share based on the dividend paid, earnings per share, log normal value of the average volume traded, price to earnings ratio, dividend yield, and log normal value of the market capitalization of the company.
After our initial data exploration we observed price, volume, and market capitalization to be log normally distributed with a positive skew. For this purpose we tranformed these values to their log normal values. This tranformation increased the accuracy of our estimated values and improved our overall Adjusted R2.
| Metric Observed | Value | Explanation |
|---|---|---|
| Multiple R-squared | 0.8759 | 87.59% of the variablity is described by the variables in this model |
| Adjusted R-squared | 0.8733 | 87.33% of the variablity is described by the variables in this model |
| F-statistic | 343.5 | Significant F-Statistics shows overall significance of the model |
| P-value | < 2.2e-16 | Significant P-Value shows overall significance of the model |
#Build the regression model
model.fit <- lm(log(Price)~
Dividend+
Earnings.Share+
log(Volume)+
Price.Earnings+
Dividend.Yield+
log(Market.Cap), sp500)
#sp500$Beta+
#sp500$Sector+
#sp500$`Major Sector`+
#sp500$PEG+
#sp500$Return.on.Assets..Before.Tax.+
#sp500$X5.Year.Return+
#sp500$X..Insider+
#sp500$X..Institutional+
#sp500$X..5.year.Dividend.Growth+
#sp500$X..5.year.Earnings.Growth+
#sp500$X..5.year.Revenue.Growth+
#log(sp500$Net.Income)+
#sp500$Divide.Payout)))
The following table shows the coeffecients of our model associated to each variable, and their associated significant P-Values:
#Model Summary
library(knitr)
kable(summary(model.fit)$coef, digits=3)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.549 | 0.351 | 7.272 | 0.000 |
| Dividend | 0.201 | 0.013 | 15.093 | 0.000 |
| Earnings.Share | 0.036 | 0.005 | 7.755 | 0.000 |
| log(Volume) | -0.269 | 0.023 | -11.856 | 0.000 |
| Price.Earnings | 0.001 | 0.001 | 2.047 | 0.042 |
| Dividend.Yield | -0.196 | 0.013 | -14.963 | 0.000 |
| log(Market.Cap) | 0.230 | 0.022 | 10.265 | 0.000 |
#Leverage Plots
leveragePlots(model.fit, pch = 16)
The P-Value of 0.0989 from Shapiro-Wilk normality test indicates that our residuals are normally distributed. This can also be observed from the QQ-Plot below, which shows a close distribution of the residuals along the line, and also from the residual plot showing an approximately even distribution of residuals above and below the y = 0 line.
#Test for normality of residuals
shapiro.test(model.fit$residuals)
##
## Shapiro-Wilk normality test
##
## data: model.fit$residuals
## W = 0.99185, p-value = 0.0989
Anova:
As P-Values of the variables in the ANOVA table are less than 0.05, we conclude that there are significant differences between the variables being observed.
#Anova
aov <- aov(model.fit)
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Dividend 1 48.78 48.78 930.70 < 2e-16 ***
## Earnings.Share 1 17.55 17.55 334.90 < 2e-16 ***
## log(Volume) 1 9.81 9.81 187.16 < 2e-16 ***
## Price.Earnings 1 2.55 2.55 48.62 2.08e-11 ***
## Dividend.Yield 1 23.80 23.80 454.08 < 2e-16 ***
## log(Market.Cap) 1 5.52 5.52 105.38 < 2e-16 ***
## Residuals 292 15.30 0.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
QQ-Plot and Residual Plot:
#QQPLOT for residuals
qqPlot(model.fit, simulate = TRUE, main = 'Q-Q Plot', labels = FALSE)
## 36 159
## 35 158
##Residual Plot
ggplot(model.fit)+
geom_point(aes(x = model.fit$fitted.values, y = model.fit$residuals))+
geom_hline(yintercept = 0)+
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")
#Non constant variance test
ncvTest(model.fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.6416114, Df = 1, p = 0.42313
No apprent cases of multicollinearity are shown.
#Test for multicollinearity exp(model.fit$fitted.values), "Predicted Price",
vif <- vif(model.fit)
vif <- data.frame(vif)
kable(vif)
| vif | |
|---|---|
| Dividend | 2.387901 |
| Earnings.Share | 1.452027 |
| log(Volume) | 2.946814 |
| Price.Earnings | 1.079173 |
| Dividend.Yield | 1.735286 |
| log(Market.Cap) | 2.726576 |
Applying our test observation to the regression model gives us a predicted price of $267.38 vs. the actual price of $222.89. The variance of predicted from the actual price is $44.49. This test case was used to predict the price of 3M Company(TKR: MMM).
There is a considerable difference between the actual and the predicted price for MMM. This effect could result from the element of uncertainty associated with our model, and the model’s inability to more strongly capture the variablity of price in relation to the variables used in building the model.
The prediction interval has a very wide range for the price of each share of MMM. We do not believe that this model precisely predicts the price of MMM’s share, but we do believe that it can be used an approximation tool.
The follwing table presents predicted vs. actual prices for MMM, along with the intervals:
#Transform data in the test observation to fit the model requirements
sp500_test <- cbind(sp500_test_P$Dividend,
sp500_test_P$Earnings.Share,
sp500_test_P$Volume,
sp500_test_P$Price.Earnings,
sp500_test_P$Dividend.Yield,
sp500_test_P$Market.Cap)
#Renaming columns in the test dataframe to match the model column names
colnames(sp500_test) <- c("Dividend",
"Earnings.Share",
"Volume",
"Price.Earnings",
"Dividend.Yield",
"Market.Cap")
sp500_test <- data.frame(sp500_test)
pred <- predict(model.fit, sp500_test, interval = "prediction")
#Actual vs. Predicted Price
avp <- data.frame(cbind(sp500_test_P$Price,exp(pred),log(sp500_test_P$Price), pred))
colnames(avp) <- c("Price ($)",
"Predicted Price ($)",
"Lower ($)",
"Upper ($)",
"Log Normal Price",
"Log Normal Predicted Price",
"Lower",
"Upper")
kable(avp)
| Price ($) | Predicted Price ($) | Lower ($) | Upper ($) | Log Normal Price | Log Normal Predicted Price | Lower | Upper |
|---|---|---|---|---|---|---|---|
| 222.89 | 262.3761 | 166.2026 | 414.2005 | 5.406678 | 5.569779 | 5.113207 | 6.02635 |
There are certain limitations and assumptions which restrict this model to more accurately predict the price of shares of the companies contained in the dataset which is used to build this model. These limitations include the fact that some of the data points in our dataset contained strong outlier values which had to be eliminated in order to have the residuals of our model be more normally distributed. Other outlier values which were not too far from the remaining dataset, and did not have a relatively high leverage, were still included in the final dataset. In doing so, we tried to capture the random nature of these prices as much as we could. This model also assumes that fundamental and technical indicators are the only predictors of prices of common shares. This assumption can also cause the predictions resulting from this model to not accurately match the current prevailing price of the shares of the company being analyzed. The reason for that is that this model does not account for human emotions/sentiments or other aspects of behavioral finance in the valuation of prices.