Service as a Freelancer

If you want hire me as a freelancer please use the following links.

Introduction

Here, I describe the stock analysis of 100 companies from the period of 2018 through 2020. The goals of this analysis is to understand the trends in the stock values of these companies and identify between stock prices and ESG ratings. The scope of this analysis is solely descriptive and is to pave a way which leads to asking questions that would encourage further investigation. Moreover, I have also performed Panel data regression analysis to test the findings.

The dataset is obtained from . The data set has the company Stock Name, Year, Annual Stock Price ,Annual Sustainalytics Rank and Annual Instit Owner % Shares Out. The dataset records values from 2018 to Dec. 2020. The dataset consists of 5 columns and the names of the 100 companies are enlisted by finding the unique entries in the Stock Name column. There are 300 observations. Before going to perform the analysis, I have changed the column names of dataset as follows.

Description about the fields in the Dataset
Attribute Description
Sname Stock Name
Year Year
ASP Annual Stock Price
ASR Annual Sustainalytics Rank
AIOSO Annual Instit Owner % Shares Out
Number of Entries in the columns of Dataset
Attribute Total Entries
Sname 300
Year 300
ASP 299
ASR 275
AIOSO 299

As we have seen in the above that the dataset needs to be preprocessed by changing the null values to zero. Before going to the processing, here shwoing the summary and statistics of the dataset. In the below table, showing the yearly average ASP, ASR and AIOSO of each company.

summary(df)
##     Sname                Year           ASP                ASR        
##  Length:300         Min.   :2018   Min.   :   36.44   Min.   : 14.80  
##  Class :character   1st Qu.:2018   1st Qu.:  455.15   1st Qu.: 55.21  
##  Mode  :character   Median :2019   Median : 1187.00   Median : 72.72  
##                     Mean   :2019   Mean   : 1873.33   Mean   : 69.81  
##                     3rd Qu.:2020   3rd Qu.: 2373.00   3rd Qu.: 85.60  
##                     Max.   :2020   Max.   :15110.00   Max.   :100.00  
##                                    NA's   :1          NA's   :25      
##      AIOSO       
##  Min.   : 16.19  
##  1st Qu.: 72.68  
##  Median : 94.15  
##  Mean   : 88.79  
##  3rd Qu.:107.46  
##  Max.   :162.86  
##  NA's   :1
df2 = df %>% group_by(Sname)
kable(df2[1:15,], caption = "Average ASP, ASR and AIOSO of first 5 Stock company")
Average ASP, ASR and AIOSO of first 5 Stock company
Sname Year ASP ASR AIOSO
AAL 2020 2424.5 89.60 110.695
AAL 2019 2173.0 85.28 108.878
AAL 2018 1747.8 95.19 126.537
ABF 2020 1938.5 44.25 96.679
ABF 2019 2343.0 44.42 96.469
ABF 2018 2236.0 38.84 90.081
ADM 2020 2906.0 27.83 78.236
ADM 2019 2309.0 26.87 73.921
ADM 2018 2047.0 24.45 72.774
AHT 2020 2175.0 39.00 99.140
AHT 2019 2122.0 29.18 91.987
AHT 2018 2033.0 21.96 88.046
ANTO 2020 1440.5 83.81 101.889
ANTO 2019 916.8 19.78 100.552
ANTO 2018 783.2 14.80 93.864

Preprocessing

Data processing is the process of making the raw data suitable for machine learning models. It is the first and crucial step while building the models. Mostly the data consist of noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. As we seen in the previous section, our dataset also contained missing values in the column of ASP, ASR and AIOSO. In order to handle the missing values, we have added 0 on the place of missing values. The code is as follows.

df[is.na(df)] = 0

Data Analysis

The dataset being a time series dataset, I have plotted the statistics about stock companies, their prices and ranks from 2018 to 2020. The plot shows some spikes and interesting facts about the relations between various variables (i.e., stock price, sustain rank and share out percentage). All the graphs shows linear and non linear relations. ### Charts of Variables

#display the dependent variables 
hist(df$ASP, freq = FALSE, col = 6, xlab = "ASP", main = "Anual Stock Price")

hist(df$ASR, freq = FALSE, col = 5, xlab = "ASR", main = "Annual Sustainalytics Rank")

hist(df$AIOSO, freq = FALSE, col = 5, xlab = "AIOSO", main = "Annual Instit Owner % Shares Out")

plot(df$ASP~df$ASR,data = df, xlab ="ASR", ylab = "ASP", col=1, main = "ASP vs ASR")

plot(df$ASP~df$AIOSO,data = df, xlab ="AIOSO", ylab = "ASP", col=6, main ="ASP vs AIOSO")

Relationship Between Variables

Using the dataset, it is possible to view the relationship between Anual Stock Price, Annual Sustainalytics Rank and Annual Instit Owner % Shares Out. To do this, a left join is used to combine. The outcome of the join is then used to plot the relationship between these variables.

df %>% ggplot(aes(ASR, ASP)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
   labs(title =paste0("STOCK",": Relationship between ASP and ASR"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df %>% ggplot(aes(AIOSO, ASP)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
   labs(title =paste0("STOCK",": Relationship between ASP and AIOSO"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df %>% ggplot(aes(AIOSO, ASR)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
   labs(title =paste0("STOCK",": Relationship between ASR and AIOSO"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The outcome of the above figures shows that the relationship between ASP and ASR as well as between ASP and AIOSO is somehow linear. However, we can see that few values within these variables are out of the scope. In other words, when the anual stock pice goes high, it disturn the linear relationship between the variables. In case of ASP and AIOSO, the relationship is smooth an linear, however, the relationshihp showing high density when the prices goes 0 to 6000 and share out goes 50 to 125. Similarly, in case of AIOSO and ASR, figure is not showing the linear relation. Figure showing the high density when the ASR goes between 50 to 100 and AIOSO goes 50 to 120. So, by seeing the relationship between ASP and ASR, we can say that Anual Stock Price depends on the Sustain Rank, however, when the prices goes above the 6000 it does not follow the relationship behaviour.

Stats Stock Companies

ggplot(df, aes(Year, ASP)) + geom_bar(stat = "identity", aes(fill=Sname))  + ggtitle(paste0("STOCK", ": Anual Stock Price")) 

ggplot(df, aes(Year, ASR)) + geom_bar(stat = "identity", aes(fill=Sname))  + ggtitle(paste0("STOCK", ": Annual Sustainalytics Rank")) 

ggplot(df, aes(Year, AIOSO)) + geom_bar(stat = "identity", aes(fill=Sname))  + ggtitle(paste0("STOCK", ": Annual Instit Owner % Shares Out")) 

First Five Stock Companies Analysis

In this section, I have shown the stats about first 5 stoc companies. The left side in the Figure showing the ASP, right side is showing the AIOS and the middle Figure showing the ASR for each company. We can see that the company SPX increased their stock price exponantially, while, the rest of the companies did not. Moreover, the AAL have high stock price than rest of the companies. In case of Rank, AAL have the highest rank, expect in year 2019, than rest of the companies. However, in 2019, CRH have the highest rank. Similarly, in case of share out percentage, AAL and SPX have high share out percentage.

stck_5 <- subset(df, 
                         (df$Sname == "AAL" | 
                          df$Sname == "ABF" | 
                          df$Sname == "ADM" |
                          df$Sname == "PRU" |
                          df$Sname == "SPX" |
                          df$Sname == "CRH"))

#Five Stock Companies

v1 = ggplot(data = stck_5, aes(x = Year, y = ASP)) +
   geom_line(aes(color = Sname)) +
   xlab("Years") +
   ylab("ASP") +
   ggtitle("Annual Stock Price")

v2 = ggplot(data = stck_5, aes(x = Year, y = ASR)) +
   geom_line(aes(color = Sname)) +
   xlab("Years") +
   ylab("ASR") +
   ggtitle("Annual Sustainalytics Rank")

v3 = ggplot(data = stck_5, aes(x = Year, y = AIOSO)) +
   geom_line(aes(color = Sname)) +
   xlab("Years") +
   ylab("AIOSO") +
   ggtitle("Annual Instit Owner % Shares Out")
grid.arrange(v1, v2, v3, ncol = 3)

Comparisons between First 10 and Last 10 Stock Companies

Here, in this section, the comparisons between stoc companies are perforemd over ASR. The X-axis showing the ASP while the Y-axis showing the ASR. The left Figure showing the first 10 stock companies, while, the right Figure showing the last 10 stock companies. As we can see in the right Figure, in case of TW, the stock company have rank between 0 to 25 along with stock price between 0 to 6000. Similarly, the first 10 companies have low having low stock prices with low rank. The same behaviour can be shown in the rest of the companies.

hlfdf = subset(df,df$Sname %in% unique(df$Sname)[1:10])
hlfdf2 = subset(df,df$Sname %in% unique(df$Sname)[91:100])

k1 = ggplot(hlfdf, aes(x=ASP, y=ASR, color=factor(Sname))) + geom_boxplot() + geom_jitter(width=.1) + facet_wrap(factor(Sname) ~ .) +labs(x='Anual Stock Price', y='Anual Sustain Rank') + theme(legend.position='none')

k2 = ggplot(hlfdf2, aes(x=ASP, y=ASR, color=factor(Sname))) + geom_boxplot() + geom_jitter(width=.1) + facet_wrap(factor(Sname) ~ .) +labs(x='Anual Stock Price', y='Anual Sustain Rank') + theme(legend.position='none')

grid.arrange(k1,k2,ncol=2)

Regression Analysis

Regression using panel data may mitigate omitted variable bias when there is no information on variables that correlate with both the regressors of interest and the independent variable and if these variables are constant in the time dimension or across entities. Provided that panel data is available panel regression methods may improve upon multiple regression models which produce results that are not internally valid in such a setting. So, here we have applied build panel modesl on the given dataset. For the setting we have used Stock price as dependent variable, while, ASP and AIOSO as independent variables.

We have used plm(), a convenient R function that enables us to estimate linear panel regression models which comes with the package plm (Croissant, Millo, and Tappe 2020). Usage of plm() is very similar as for the function lm() which have been used by the majority of the developer in the literature. Sometimes panel data is also called longitudinal data as it adds a temporal dimension to cross-sectional data. Let us have a look at the dataset by checking its structure and listing the first few observations.

head(df)
##   Sname Year    ASP   ASR   AIOSO
## 1   AAL 2020 2424.5 89.60 110.695
## 2   AAL 2019 2173.0 85.28 108.878
## 3   AAL 2018 1747.8 95.19 126.537
## 4   ABF 2020 1938.5 44.25  96.679
## 5   ABF 2019 2343.0 44.42  96.469
## 6   ABF 2018 2236.0 38.84  90.081
str(df)
## 'data.frame':    300 obs. of  5 variables:
##  $ Sname: chr  "AAL" "AAL" "AAL" "ABF" ...
##  $ Year : num  2020 2019 2018 2020 2019 ...
##  $ ASP  : num  2424 2173 1748 1938 2343 ...
##  $ ASR  : num  89.6 85.3 95.2 44.2 44.4 ...
##  $ AIOSO: num  110.7 108.9 126.5 96.7 96.5 ...

We find that the dataset consists of 300 observations on 5 variables. Since all variables are observed for all entities and over all time periods, the panel is balanced. If there were missing data for at least one entity in at least one time period we would call the panel unbalanced. As in the preprocessing step we have made the data balanced by adding 0 to the null values. Now, the data is in the shape that is suitable for the regression analysis.

Pooled Model

#Model 3, OLS Pooled Model with ASP, ASR and AIOSO
mod3 <- plm(ASP ~ ASR + AIOSO,
            data = df, index = c("Sname"), model = "pooling")
summary(mod3)
## Pooling Model
## 
## Call:
## plm(formula = ASP ~ ASR + AIOSO, data = df, model = "pooling", 
##     index = c("Sname"))
## 
## Balanced Panel: n = 100, T = 3, N = 300
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -2181.23 -1338.50  -644.25   549.85 13297.10 
## 
## Coefficients:
##             Estimate Std. Error t-value Pr(>|t|)   
## (Intercept) 305.0623   548.2243  0.5565 0.578319   
## ASR           7.2956     4.5250  1.6123 0.107962   
## AIOSO        12.3757     4.5895  2.6965 0.007407 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1319400000
## Residual Sum of Squares: 1281800000
## R-Squared:      0.028488
## Adj. R-Squared: 0.021946
## F-statistic: 4.35449 on 2 and 297 DF, p-value: 0.013679
yhat <- mod3$df.residual
ggplot(df, aes(x = ASR + AIOSO, y = ASP))+
   geom_point() +
   geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Fixed Effect Model

#Model 4, Fixed Effect Model with ASP, ASR and AIOSO
mod4 <- plm(ASP ~ ASR + AIOSO,
            data = df, index = c("Sname"), model = "within")
summary(mod4)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = ASP ~ ASR + AIOSO, data = df, model = "within", 
##     index = c("Sname"))
## 
## Balanced Panel: n = 100, T = 3, N = 300
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -3790.1989  -124.0785     7.5393   143.4727  4795.7467 
## 
## Coefficients:
##       Estimate Std. Error t-value Pr(>|t|)
## ASR   -0.11823    3.29004 -0.0359   0.9714
## AIOSO -4.85512    3.89020 -1.2480   0.2135
## 
## Total Sum of Squares:    122300000
## Residual Sum of Squares: 121340000
## R-Squared:      0.0078273
## Adj. R-Squared: -0.49828
## F-statistic: 0.781021 on 2 and 198 DF, p-value: 0.45934
yhat <- mod4$df.residual
ggplot(df, aes(x = ASR + AIOSO, y = ASP))+
   geom_point() +
   geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Results Descussion

The first model is Pooled model with ASP, ASR and AIOS, while, the second is fixed effect model with the same variables. In case of fixed effect model, the sampling distribution of the OLS estimator in the fixed effects regression model is normal in large samples. In this case, each stock company is allowed its own intercept, so that the slope on the ASP variable can be interpreted as the effect of a change in the ASR and AIOSO within stock company. The model implicitly assumes that within each stock company, the ASP has the same marginal impact on ASR and AIOSO. The variance of the estimates can be estimated and we can compute standard errors,t-statistics and confidence intervals for coefficients. As you can see, the estimated coeffients for the ASR and AIOSO are negative, where, the estimated value for ASR is -0.11 and for AIOSO is -4.85. Similarly, R suared is 0.007 and P value is 0.45 which is not significant. Results show that the fixed effect model with proportional odds is not good than the null model as indicated by the p-value, which is greater than 0.01.

Similarly, in case of pooled model, the P value is less than or equal to 0.01 which is equal to the null hypothesis ans suggested that pooled model is better than the fixed effect model.

In both plots, each point represents observations of ASP, ASR and AIOSO for a given Stock company in the respective year. The regression results indicate a positive relationship between the ASP, ASR and AIOSO for all the years (i.e., 2018, 2019 and 2020), however, in the fixed effect model it is negative relationship. The estimated coefficient on ASR is possitive and significantly not different from zero at 5%. Its interpretation is that raising the stock price causes ASR and AIOSO to decrease.

Testing the models

To decide between fixed effect or pooled model you can run a Hausman test where the null hypothesis is that the preferred model is pooled model vs. the alternative. It basically tests whether the unique errors are correlated with the regressors, the null hypothesis is they are not. If the p-value is significant (for example <0.05) then use pooled model, if not use fixed effects.

phtest(mod3,mod4)
## 
##  Hausman Test
## 
## data:  ASP ~ ASR + AIOSO
## chisq = 51.343, df = 2, p-value = 7.096e-12
## alternative hypothesis: one model is inconsistent

In the results, we can see that p value is less than 0.05 which is 7.096e-12. Which is showing that pooled model is better to used.