Abstract

The aim of the present work is to analyze the productivity of companies within the S&P 500 Index by sector, with a specific focus on the Healthcare sector. This analysis is based on various variables derived from a data set obtained from https://www.kaggle.com/, which contains real-time financial data that is continuously updated. The goal is to identify which variables are directly related to a company’s productivity index, as well as to determine which sectors are the most productive within the S&P 500. The Healthcare sector was chosen as a reference, as it aligns with my MBA focus in Health Systems.

Introduction

The data set of the work was taken from Kaggle after a research of free and available financial that could help us with the analysis. The site from which the data set was downloaded is https://www.kaggle.com/datasets/franoisgeorgesjulien/s-and-p-500-companies-with-financial-information. This site allows to free download data sets in .csv format. The description of the set is:

“This dataset provides comprehensive financial information for companies listed in the S&P 500 index. The dataset encompasses a range of fundamental financial metrics and attributes, making it a valuable resource for financial analysis, investment research, and market insights.”

Analysis and Results

After downloading the set and saving it in the workspace, the first step is to tidy up the data.

First, lets include all the necessary libraries

library(dplyr)
library(datasets)
library(GGally)
library(ggplot2)
library(tidyr)
library(tidyselect)
library(rgl)

After downloading the set and saving it in the workspace, the first step is to tidy up the data. Then we load up the data, show the heads and check all the data types.

SP500 <- read.csv("S&P500_financials.csv")
str(SP500)
## 'data.frame':    505 obs. of  14 variables:
##  $ Symbol        : chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ Name          : chr  "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ...
##  $ Sector        : chr  "Industrials" "Industrials" "Health Care" "Health Care" ...
##  $ Price         : num  222.9 60.2 56.3 108.5 150.5 ...
##  $ Price.Earnings: num  24.3 27.8 22.5 19.4 25.5 ...
##  $ Dividend.Yield: num  2.33 1.15 1.91 2.5 1.71 ...
##  $ Earnings.Share: num  7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ...
##  $ X52.Week.Low  : num  259.8 68.4 64.6 125.9 162.6 ...
##  $ X52.Week.High : num  175.5 48.9 42.3 60 114.8 ...
##  $ Market.Cap    : num  1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ...
##  $ EBITDA        : num  9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ...
##  $ Price.Sales   : num  4.39 3.58 3.74 6.29 2.6 ...
##  $ Price.Book    : num  11.34 6.35 3.19 26.14 10.62 ...
##  $ SEC.Filings   : chr  "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...

After a research, it is determined that the Earnings.Share variable is the most representative value to manage as a productivity index, because it reflects how the operations generate gains by unit. For simplicity we will recall it EPS and will be defined as our dependant variable (x).

colnames(SP500)[colnames(SP500) == "Earnings.Share"] <- "EPS"

As we see, there is unnecesary data on the set, so we start tyding and retrieving all the non-numerical data, and separating the Healthcare sector data that is out matter

SP500 <- SP500 %>% dplyr::select(-SEC.Filings) #Erase column
healthcare <- SP500 %>%
  filter(Sector == "Health Care") #Filter Healthcare sector
rownames(healthcare) <- healthcare$Name
healthcare_num <- healthcare[, -which(names(healthcare) == "Name")] #Erasing names row from data

healthcare_num <- healthcare_num %>% select(where(is.numeric)) #Selecting only numeric data
healthcare_num <- na.omit(healthcare_num) #Deletes empty rows

After having our data set tidy, now we proceed to do Linear Regression to all the posible predictors, so we can give ourselves and idea of which variables could be more related in a lineal model.

model1 <- lm(EPS ~ ., data = healthcare_num)
summary(model1)
## 
## Call:
## lm(formula = EPS ~ ., data = healthcare_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2024  -2.4581  -0.5838   1.0121  27.7624 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     2.275e+00  1.783e+00   1.277    0.208  
## Price          -4.489e-02  3.922e-02  -1.145    0.258  
## Price.Earnings  2.074e-03  2.185e-02   0.095    0.925  
## Dividend.Yield  7.335e-01  8.745e-01   0.839    0.406  
## X52.Week.Low    4.273e-02  3.737e-02   1.143    0.258  
## X52.Week.High   3.422e-02  4.814e-02   0.711    0.481  
## Market.Cap     -1.220e-12  2.666e-11  -0.046    0.964  
## EBITDA         -8.210e-11  3.120e-10  -0.263    0.794  
## Price.Sales    -4.471e-01  2.274e-01  -1.966    0.055 .
## Price.Book     -1.101e-02  2.994e-02  -0.368    0.715  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.486 on 49 degrees of freedom
## Multiple R-squared:  0.3114, Adjusted R-squared:  0.1849 
## F-statistic: 2.462 on 9 and 49 DF,  p-value: 0.02109

As we can observe, the two predictors with the lowest p value are Price.Sales that represents the price-to-sales ratio, which compares the stock’s price with the companies gains per action, and X52.Week.Low that represents the lowest price at which the stock has traded over the past 52 weeks.

Now we proceed to obtain the best model with the R ‘step’ function:

bestmodel <- step(model1)

Finally, the predictors with the lowest AIC value are determined by the step function

summary(bestmodel)
## 
## Call:
## lm(formula = EPS ~ X52.Week.Low + Price.Sales, data = healthcare_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2616 -2.0858 -0.5304  0.9680 30.9430 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.000457   1.292116   2.322   0.0239 *  
## X52.Week.Low  0.026533   0.006006   4.418 4.62e-05 ***
## Price.Sales  -0.499376   0.189376  -2.637   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.276 on 56 degrees of freedom
## Multiple R-squared:  0.2721, Adjusted R-squared:  0.2461 
## F-statistic: 10.47 on 2 and 56 DF,  p-value: 0.0001374

As we can see in the summary the p value of both variables is very low. This give us enough evidence to reject the Null Hypothesis (H0). So our Alternative Hypothesis (HA) would be that these two predictors have a significant impact on the dependant variable (EPS).

Now we proceed to do linear regression of the two predictors combined, and independently to see which model gives us the lowest R2. Subsequently we plot all the models in a dispersion graph.

EPS vs. Price Sales

The \(R^2\) of the Price.Sales predictor is equal to: 0.0184296.

EPS vs. X52.Week.Low

The \(R^2\) of the X52.Week.Low predictor is equal to: 0.1817196.

And now with model obtained with the step function.The \(R^2\) of the X52.Week.Low+Price.Sales predictors is equal to: 0.2721029.

As we can determine the \(R^2\) is higher.

So now we plot the model with both predictors:

## Warning in abline(bestmodel, col = "orange"): only using the first two of 3
## regression coefficients

As we can observe, the model has a better adjustment.

Due to the step model obtained has 3 variables, it is better to plot the function as a surface:

Analysis by sector

Afterwards we are going to do a final analysis in which we’ll see if the company with the highest EPS belongs to the most productive sector using ANOVA.

First, we propose our 2 hypothesis:

First we determine the most productive sector averaging the EPS:

Productivy Mean by Sectors
Sector EPS Mean
Industrials 5.344030
Materials 4.932000
Consumer Discretionary 4.879048
Financials 4.436912
Telecommunication Services 4.426667
Health Care 4.340492
Consumer Staples 3.659412
Information Technology 3.110000
Real Estate 2.391212
Utilities 2.014643
Energy -1.648438

Now we do the ANOVA test with the EPS and Sector data to extract the p value:

##                Pr(>F)    
## Sector      8.856e-07 ***
## Residuals                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is equal to: 8.856x10-7. Therefor, p is lower than 0.05, so the Null Hypothesis will be rejected.

So, then we probe the Alternative Hypothesis with acquiring the Company with the maximum EPS:

The company with the max EPS is: AutoZone Inc and belongs to the Consumer Discretionary sector. The most productive sector with the highest EPS value is Industrials, therefor the Alternative Hypothesis is proven.

Conclussions

Of this analysis we can conclude many interesting facts. The first one is that we can complete relate the increasing or decreasing of the productivity of a Company in the S&P 500 Index through the EPS, acquiring the best predictors in a data set with the step function.

However, the \(R^2\) of the model is still low, rounding the 27% from the function acquired with the step function. It is known that this value in Financial Data sets for linear model is commonly low because of volatility of the stock market. Maybe another type of modeling such exponential or logarithmic would give us a higher \(R^2\) value.

Finally, we figured out that with an ANOVA test, considering 3 or more groups or categories of continuous data type, is very helpful at working with hypothesis probes, as we could reject the alternative hypothesis in the analysis.