S&P 500 Fundamentals

Abstract:

This research attempts at understanding the relation between the price of common shares, of companies in the S&P 500 Index, and their fundamental indicators of value. We believed that the price of common shares was more closely related to each company’s fundamental indicators instead of the technical indicators. However, for our research we employed the use of both such indicators of value and analyzed their individual relationships to price. Our exploratory analysis indicates that each variable has its own data distribution - with price, volume, and market cap being more log normally distributed then beta, which seems to have an approximately normal distribution. Upon model building and analysis we found that both fundamental as well as technical indicators can be used to estimate the price of common shares and that these indicators are reflective of this information in the estimated price of common shares for these companies. In our model, the fundamental indicators which show this relationship to price include Dividends, Earnings per Share, Price to Earnings ratio, Dividend yield, and Market Capitalizations whereas Volume is the technical indicator which in indicative of this relationship.

Introduction:

The purpose of our research is to conduct an anlysis of the value, as reflected in the price of common shares, of dividend paying companies in the S&P 500 Index. Our objective is to understand the relationship between the price of common shares and fundamental indicators of value. We plan on investigating and understanding this relationship in three sequential steps:

Visual Exploration of Data
Correlation Analysis
Multiple Regression Modeling

Over the years in the investment industry, value based investors have found significant success through various valuation techniques and methodologies. It is our intention to assess the validity of this phenomenon by analyzing whether if this information, through fundamental indicators, is truly indicated in common stock prices of pulic companies in S&P 500.

Overview of the dataset:

This dataset was collected from Data Hub by making custom filters to select the variables of interest for this research. This dataset is not licensed.

Dataset includes the following variables:

Variable	Description
Sector	Sector/Industry to which the company belongs
Major Sector	Major Sector to which the company belongs
Price	Price of a common share
PEG	Price to earnings to growth ratio
Return on Assets (before taxes)	Return on assets for the company
5 Year Return	5 year return of common shares
% insider	Percentage of insider ownership
% institutional	Percentage of institutional ownership
% 5 Year dividend growth	5 year dividend growth rate
% 5 Year earnings growth	5 year earnings growth rate
% 5 Year revenue growth	5 year revenue growth rate
Volume	Average volume traded for the stock
Dividend	Most recent dividend paid on the stock
Beta	Measure of volatility or market risk for the stock
Net Income	Most recent net income of the company
Price/Earnings	Price to earnings ratio of the stock
Dividend Yield	Dividend yield on the stock
Dividend Payout	Dividend payout for the stock
Earnings/Share	Earnings per share
Market Cap	Market capitalization of the company
EBITDA	Earnings Before Interest, Taxes, Depreciation, and Amortization
Price/Sales	Price to sales ratio of the stock
Price/Book	Price to book ratio of the stock

data1 <- read_csv("SP500_Fundamentals.csv")
data2 <- data1[, c(5:25)]
data2 <- as.data.frame(lapply(data2[,c(1:20)], as.numeric))
sp500_train <- cbind(data1$Sector, data1$`Major Sector`, data2)
colnames(sp500_train) <- c("Sector","Major Sector", colnames(data2))
sp500_train <- na.omit(sp500_train)
sp500_train$`Major Sector`<- factor(sp500_train$`Major Sector`)
sp500_train$Sector<- factor(sp500_train$Sector)
sp500_train <- filter(sp500_train, sp500_train$Dividend != 0 & 
                      sp500_train$Dividend.Yield != 0 & 
                      sp500_train$EBITDA != 0 & 
                      sp500_train$Price.Earnings >= 0)
sp500 <- sp500_train[c(2:300),c(1:22)]

#Test Observation
sp500_test_P <- sp500_train[1,c(3:22)]

Exploratory Analysis:

Scatter Plots with Linear Line fit:

The following scatter plots show the distribution of each variable against price. The lines presented in each scatter plot show a linear least squares fit between price and the corresponding variable.

ggplot(aes(y=Price, x=PEG), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Return.on.Assets..Before.Tax.), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X5.Year.Return), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X..Insider), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X..Institutional), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X..5.year.Dividend.Growth), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X..5.year.Earnings.Growth), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=X..5.year.Revenue.Growth), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Volume), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Dividend), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Beta), data=sp500) +
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Net.Income), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Price.Earnings), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Dividend.Yield), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Divide.Payout), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Earnings.Share), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Market.Cap), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=EBITDA), data=sp500) + 
  geom_point() + 
  stat_smooth(method='lm', se=FALSE)

ggplot(aes(y=Price, x=Price.Sales), data=sp500) + 
  geom_point() +
  stat_smooth(method='lm', se=FALSE)

Histograms:

In examining the distributions of our variables, some are generally normally distributed, whereas many are right-skewed. The X..Institutional variable is left-skewed. There are many outliers present in the distributions as well which can potentially have a high leverage effect in a regression model. For the Major Sector variable, it is important to note that a large number of the data collected falls in the Secondary and Tertiary major sector. Beta is the most normally distributed variable.

ggplot(aes(x=PEG), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Price), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Return.on.Assets..Before.Tax.), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X5.Year.Return), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X..Insider), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X..Institutional), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X..5.year.Dividend.Growth), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X..5.year.Earnings.Growth), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=X..5.year.Revenue.Growth), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Volume), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)+
  scale_x_continuous(labels = scales::comma)

ggplot(aes(x=Dividend), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Beta), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Net.Income), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

  #scale_x_continuous(labels = scales::comma)

ggplot(aes(x=Price.Earnings), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=40)

ggplot(aes(x=Dividend.Yield), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Divide.Payout), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Earnings.Share), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Market.Cap), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=EBITDA), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=Price.Sales), data=sp500) + 
  geom_histogram(color="black", fill="light green", bins=12)

ggplot(aes(x=`Major Sector`), data=sp500) + 
  geom_histogram(color="black", fill="light green", stat = "count")

ggplot(aes(x=`Sector`), data=sp500) + 
  geom_histogram(color="black", fill="light green", stat = "count")+ 
  coord_flip()

Correlation Matrix:

In analyzing the correlations between the variables in relation to Price, it is necessary to first acknowledge that the correlation matrix shows no strong correlations between any set of variables. Hence, there is no clear multicollinearity present. There are a few nagative correlations between Price and Dividened.Yield, Volume, PEG, and there are a few that seem to have a horizontal correlation, meaning they might not affect Price directly. These include the variables X..Insider, X..5.year.Earnings.Growth, and Divide.Payout. The variables which seem to show a positive correlation to price include Price.Sales, Market.Cap, Earnings.Share, Net.Income, Beta, X5.Year.Return, Return.on.Assets..Before.Tax,and Dividend. These variables have the strongest positive correlation with Price. The remaining variables have a slight positive relationship with Price, but one could argue they are close to a horizontal correlation. The only notable possible example of multicollinearity is between the variables Market.Cap, and EBITDA, showing a correlation value of approximately 0.8. Net.Income also seems to have a relatively strong correlation with Market.Cap and EBITDA. The following is a clustered plot of correlations between the variables in the dataset:

library(corrplot)
corrplot(cor(sp500[,c(3:21)], 
         method = "pearson"),
         title = "Correlation Plot", 
         method = "circle", 
         outline = T, 
         order="hclust", 
         mar = c(0,0,0.95,0), 
         cl.pos = "b", 
         tl.col = "indianred4", 
         tl.cex = .6, 
         cl.cex = .6)

Analysis of the categorical variable:

Major Sectors

Categorizing data by Major Sector shows that the all major sectors have medians relatively close to each other. However, Primary and Quarternary major sectors have a different data distribution in comparison to Secondary and Tertiary major sectors. Primary and Secondary sectors show the median values to be closer to the upper quartile which shows a negative skew, whereas Secondary and Tertiary sectors have their medians closer to the lower quartile showing a positive skew. Secondary and Tertiary sectors also have more outliers as compared to Primary and Quaternary

Sectors

Categorizing data into sectors shows minor variation of medians across sectors. The inter quartile ranges for all sectors seem relatively equal except fpr Telecommunication services, Real Estate, Industrials, and Financials. The real estate sector has the largest outlier, with matrials haveing and negative skew, and industrials having a positive skew.

ggplot(sp500, aes(x=sp500$`Major Sector`, y=sp500$`Price`)) +
  stat_boxplot(geom='errorbar', width=0.5) +
  geom_boxplot(outlier.size = 1, aes(fill= `Major Sector`)) +
  labs(x="Major Sector", y="Price",  title="Price by Major Sector")

ggplot(sp500, aes(x=sp500$`Sector`, y=sp500$Price)) +
  stat_boxplot(geom='errorbar', width=0.5)+
  geom_boxplot(outlier.size = 1, aes(fill= Sector)) +
  labs(x = "Sector", y = "Price", title = "Price by Sector")+
  coord_flip()

Regression Model:

Model Selection and Data Transformation:

In building the regression model our objective was to maximize the Adjusted R² value while minimizing the number of insignificant variables. To test the significance of each variable’s P-Value our Alpha was set at 0.05. In order to determine the best Adjusted R² value with statistically significant P-Values we used backward elimination. While employing the use of this technique we eliminated insignificant variables until our Adjusted R² was maximized and our model had statistically significant variables. The variables which were eliminated from the model include:

Beta
Sector
Major Sector
PEG
Return.on.Assets..Before.Tax.
X5.Year.Return
X..Insider
X..Institutional
X..5.year.Dividend.Growth
X..5.year.Earnings.Growth
X..5.year.Revenue.Growth
Net.Income
Divide.Payout

Observations:

Our model estimates the log normal value of the price of each share based on the dividend paid, earnings per share, log normal value of the average volume traded, price to earnings ratio, dividend yield, and log normal value of the market capitalization of the company.

After our initial data exploration we observed price, volume, and market capitalization to be log normally distributed with a positive skew. For this purpose we tranformed these values to their log normal values. This tranformation increased the accuracy of our estimated values and improved our overall Adjusted R².

Metric Observed	Value	Explanation
Multiple R-squared	0.8759	87.59% of the variablity is described by the variables in this model
Adjusted R-squared	0.8733	87.33% of the variablity is described by the variables in this model
F-statistic	343.5	Significant F-Statistics shows overall significance of the model
P-value	< 2.2e-16	Significant P-Value shows overall significance of the model

#Build the regression model
model.fit <- lm(log(Price)~
             Dividend+
             Earnings.Share+
             log(Volume)+
             Price.Earnings+
             Dividend.Yield+
             log(Market.Cap), sp500)
             #sp500$Beta+
             #sp500$Sector+
             #sp500$`Major Sector`+
             #sp500$PEG+
             #sp500$Return.on.Assets..Before.Tax.+
             #sp500$X5.Year.Return+
             #sp500$X..Insider+
             #sp500$X..Institutional+
             #sp500$X..5.year.Dividend.Growth+
             #sp500$X..5.year.Earnings.Growth+
             #sp500$X..5.year.Revenue.Growth+
             #log(sp500$Net.Income)+
             #sp500$Divide.Payout)))

The following table shows the coeffecients of our model associated to each variable, and their associated significant P-Values:

#Model Summary
library(knitr)
kable(summary(model.fit)$coef, digits=3)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.549	0.351	7.272	0.000
Dividend	0.201	0.013	15.093	0.000
Earnings.Share	0.036	0.005	7.755	0.000
log(Volume)	-0.269	0.023	-11.856	0.000
Price.Earnings	0.001	0.001	2.047	0.042
Dividend.Yield	-0.196	0.013	-14.963	0.000
log(Market.Cap)	0.230	0.022	10.265	0.000

#Leverage Plots
leveragePlots(model.fit, pch = 16)

Test for Normality of Residuals:

The P-Value of 0.0989 from Shapiro-Wilk normality test indicates that our residuals are normally distributed. This can also be observed from the QQ-Plot below, which shows a close distribution of the residuals along the line, and also from the residual plot showing an approximately even distribution of residuals above and below the y = 0 line.

#Test for normality of residuals
shapiro.test(model.fit$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model.fit$residuals
## W = 0.99185, p-value = 0.0989

Anova:

As P-Values of the variables in the ANOVA table are less than 0.05, we conclude that there are significant differences between the variables being observed.

#Anova
aov <- aov(model.fit)
summary(aov)

##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## Dividend          1  48.78   48.78  930.70  < 2e-16 ***
## Earnings.Share    1  17.55   17.55  334.90  < 2e-16 ***
## log(Volume)       1   9.81    9.81  187.16  < 2e-16 ***
## Price.Earnings    1   2.55    2.55   48.62 2.08e-11 ***
## Dividend.Yield    1  23.80   23.80  454.08  < 2e-16 ***
## log(Market.Cap)   1   5.52    5.52  105.38  < 2e-16 ***
## Residuals       292  15.30    0.05                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

QQ-Plot and Residual Plot:

#QQPLOT for residuals
qqPlot(model.fit, simulate = TRUE, main = 'Q-Q Plot', labels = FALSE)

##  36 159 
##  35 158

##Residual Plot
ggplot(model.fit)+
geom_point(aes(x = model.fit$fitted.values, y = model.fit$residuals))+
geom_hline(yintercept = 0)+
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")

Test for non-constant variance:

#Non constant variance test
ncvTest(model.fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.6416114, Df = 1, p = 0.42313

Test for Multicollinearity:

No apprent cases of multicollinearity are shown.

#Test for multicollinearity exp(model.fit$fitted.values), "Predicted Price", 
vif <- vif(model.fit)
vif <- data.frame(vif)
kable(vif)

	vif
Dividend	2.387901
Earnings.Share	1.452027
log(Volume)	2.946814
Price.Earnings	1.079173
Dividend.Yield	1.735286
log(Market.Cap)	2.726576

Application of the model:

Applying our test observation to the regression model gives us a predicted price of $267.38 vs. the actual price of $222.89. The variance of predicted from the actual price is $44.49. This test case was used to predict the price of 3M Company(TKR: MMM).

There is a considerable difference between the actual and the predicted price for MMM. This effect could result from the element of uncertainty associated with our model, and the model’s inability to more strongly capture the variablity of price in relation to the variables used in building the model.

The prediction interval has a very wide range for the price of each share of MMM. We do not believe that this model precisely predicts the price of MMM’s share, but we do believe that it can be used an approximation tool.

The follwing table presents predicted vs. actual prices for MMM, along with the intervals:

#Transform data in the test observation to fit the model requirements
sp500_test <- cbind(sp500_test_P$Dividend,
                    sp500_test_P$Earnings.Share,
                    sp500_test_P$Volume,
                    sp500_test_P$Price.Earnings,
                    sp500_test_P$Dividend.Yield,
                    sp500_test_P$Market.Cap)

#Renaming columns in the test dataframe to match the model column names
colnames(sp500_test) <- c("Dividend",
                    "Earnings.Share",
                    "Volume",
                    "Price.Earnings",
                    "Dividend.Yield",
                    "Market.Cap")

sp500_test <- data.frame(sp500_test)

pred <- predict(model.fit, sp500_test, interval = "prediction")

#Actual vs. Predicted Price
avp <- data.frame(cbind(sp500_test_P$Price,exp(pred),log(sp500_test_P$Price), pred))

colnames(avp) <- c("Price ($)", 
                   "Predicted Price ($)",
                   "Lower ($)",
                   "Upper ($)",
                   "Log Normal Price",
                   "Log Normal Predicted Price",
                   "Lower",
                   "Upper")
kable(avp)

Price ($)	Predicted Price ($)	Lower ($)	Upper ($)	Log Normal Price	Log Normal Predicted Price	Lower	Upper
222.89	262.3761	166.2026	414.2005	5.406678	5.569779	5.113207	6.02635

Limitations and Assuptions:

There are certain limitations and assumptions which restrict this model to more accurately predict the price of shares of the companies contained in the dataset which is used to build this model. These limitations include the fact that some of the data points in our dataset contained strong outlier values which had to be eliminated in order to have the residuals of our model be more normally distributed. Other outlier values which were not too far from the remaining dataset, and did not have a relatively high leverage, were still included in the final dataset. In doing so, we tried to capture the random nature of these prices as much as we could. This model also assumes that fundamental and technical indicators are the only predictors of prices of common shares. This assumption can also cause the predictions resulting from this model to not accurately match the current prevailing price of the shares of the company being analyzed. The reason for that is that this model does not account for human emotions/sentiments or other aspects of behavioral finance in the valuation of prices.