If you want hire me as a freelancer please use the following links.
Here, I describe the stock analysis of 100 companies from the period of 2018 through 2020. The goals of this analysis is to understand the trends in the stock values of these companies and identify between stock prices and ESG ratings. The scope of this analysis is solely descriptive and is to pave a way which leads to asking questions that would encourage further investigation. Moreover, I have also performed Panel data regression analysis to test the findings.
The dataset is obtained from
| Attribute | Description |
|---|---|
| Sname | Stock Name |
| Year | Year |
| ASP | Annual Stock Price |
| ASR | Annual Sustainalytics Rank |
| AIOSO | Annual Instit Owner % Shares Out |
| Attribute | Total Entries |
|---|---|
| Sname | 300 |
| Year | 300 |
| ASP | 299 |
| ASR | 275 |
| AIOSO | 299 |
As we have seen in the above that the dataset needs to be preprocessed by changing the null values to zero. Before going to the processing, here shwoing the summary and statistics of the dataset. In the below table, showing the yearly average ASP, ASR and AIOSO of each company.
summary(df)## Sname Year ASP ASR
## Length:300 Min. :2018 Min. : 36.44 Min. : 14.80
## Class :character 1st Qu.:2018 1st Qu.: 455.15 1st Qu.: 55.21
## Mode :character Median :2019 Median : 1187.00 Median : 72.72
## Mean :2019 Mean : 1873.33 Mean : 69.81
## 3rd Qu.:2020 3rd Qu.: 2373.00 3rd Qu.: 85.60
## Max. :2020 Max. :15110.00 Max. :100.00
## NA's :1 NA's :25
## AIOSO
## Min. : 16.19
## 1st Qu.: 72.68
## Median : 94.15
## Mean : 88.79
## 3rd Qu.:107.46
## Max. :162.86
## NA's :1
df2 = df %>% group_by(Sname)
kable(df2[1:15,], caption = "Average ASP, ASR and AIOSO of first 5 Stock company")| Sname | Year | ASP | ASR | AIOSO |
|---|---|---|---|---|
| AAL | 2020 | 2424.5 | 89.60 | 110.695 |
| AAL | 2019 | 2173.0 | 85.28 | 108.878 |
| AAL | 2018 | 1747.8 | 95.19 | 126.537 |
| ABF | 2020 | 1938.5 | 44.25 | 96.679 |
| ABF | 2019 | 2343.0 | 44.42 | 96.469 |
| ABF | 2018 | 2236.0 | 38.84 | 90.081 |
| ADM | 2020 | 2906.0 | 27.83 | 78.236 |
| ADM | 2019 | 2309.0 | 26.87 | 73.921 |
| ADM | 2018 | 2047.0 | 24.45 | 72.774 |
| AHT | 2020 | 2175.0 | 39.00 | 99.140 |
| AHT | 2019 | 2122.0 | 29.18 | 91.987 |
| AHT | 2018 | 2033.0 | 21.96 | 88.046 |
| ANTO | 2020 | 1440.5 | 83.81 | 101.889 |
| ANTO | 2019 | 916.8 | 19.78 | 100.552 |
| ANTO | 2018 | 783.2 | 14.80 | 93.864 |
Data processing is the process of making the raw data suitable for machine learning models. It is the first and crucial step while building the models. Mostly the data consist of noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. As we seen in the previous section, our dataset also contained missing values in the column of ASP, ASR and AIOSO. In order to handle the missing values, we have added 0 on the place of missing values. The code is as follows.
df[is.na(df)] = 0The dataset being a time series dataset, I have plotted the statistics about stock companies, their prices and ranks from 2018 to 2020. The plot shows some spikes and interesting facts about the relations between various variables (i.e., stock price, sustain rank and share out percentage). All the graphs shows linear and non linear relations. ### Charts of Variables
#display the dependent variables
hist(df$ASP, freq = FALSE, col = 6, xlab = "ASP", main = "Anual Stock Price")hist(df$ASR, freq = FALSE, col = 5, xlab = "ASR", main = "Annual Sustainalytics Rank")hist(df$AIOSO, freq = FALSE, col = 5, xlab = "AIOSO", main = "Annual Instit Owner % Shares Out")plot(df$ASP~df$ASR,data = df, xlab ="ASR", ylab = "ASP", col=1, main = "ASP vs ASR")plot(df$ASP~df$AIOSO,data = df, xlab ="AIOSO", ylab = "ASP", col=6, main ="ASP vs AIOSO")Using the dataset, it is possible to view the relationship between Anual Stock Price, Annual Sustainalytics Rank and Annual Instit Owner % Shares Out. To do this, a left join is used to combine. The outcome of the join is then used to plot the relationship between these variables.
df %>% ggplot(aes(ASR, ASP)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
labs(title =paste0("STOCK",": Relationship between ASP and ASR"))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df %>% ggplot(aes(AIOSO, ASP)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
labs(title =paste0("STOCK",": Relationship between ASP and AIOSO"))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df %>% ggplot(aes(AIOSO, ASR)) + geom_point(color="blue") + geom_smooth(model=lm, color = "black") +
labs(title =paste0("STOCK",": Relationship between ASR and AIOSO"))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The outcome of the above figures shows that the relationship between ASP and ASR as well as between ASP and AIOSO is somehow linear. However, we can see that few values within these variables are out of the scope. In other words, when the anual stock pice goes high, it disturn the linear relationship between the variables. In case of ASP and AIOSO, the relationship is smooth an linear, however, the relationshihp showing high density when the prices goes 0 to 6000 and share out goes 50 to 125. Similarly, in case of AIOSO and ASR, figure is not showing the linear relation. Figure showing the high density when the ASR goes between 50 to 100 and AIOSO goes 50 to 120. So, by seeing the relationship between ASP and ASR, we can say that Anual Stock Price depends on the Sustain Rank, however, when the prices goes above the 6000 it does not follow the relationship behaviour.
ggplot(df, aes(Year, ASP)) + geom_bar(stat = "identity", aes(fill=Sname)) + ggtitle(paste0("STOCK", ": Anual Stock Price")) ggplot(df, aes(Year, ASR)) + geom_bar(stat = "identity", aes(fill=Sname)) + ggtitle(paste0("STOCK", ": Annual Sustainalytics Rank")) ggplot(df, aes(Year, AIOSO)) + geom_bar(stat = "identity", aes(fill=Sname)) + ggtitle(paste0("STOCK", ": Annual Instit Owner % Shares Out")) In this section, I have shown the stats about first 5 stoc companies. The left side in the Figure showing the ASP, right side is showing the AIOS and the middle Figure showing the ASR for each company. We can see that the company SPX increased their stock price exponantially, while, the rest of the companies did not. Moreover, the AAL have high stock price than rest of the companies. In case of Rank, AAL have the highest rank, expect in year 2019, than rest of the companies. However, in 2019, CRH have the highest rank. Similarly, in case of share out percentage, AAL and SPX have high share out percentage.
stck_5 <- subset(df,
(df$Sname == "AAL" |
df$Sname == "ABF" |
df$Sname == "ADM" |
df$Sname == "PRU" |
df$Sname == "SPX" |
df$Sname == "CRH"))
#Five Stock Companies
v1 = ggplot(data = stck_5, aes(x = Year, y = ASP)) +
geom_line(aes(color = Sname)) +
xlab("Years") +
ylab("ASP") +
ggtitle("Annual Stock Price")
v2 = ggplot(data = stck_5, aes(x = Year, y = ASR)) +
geom_line(aes(color = Sname)) +
xlab("Years") +
ylab("ASR") +
ggtitle("Annual Sustainalytics Rank")
v3 = ggplot(data = stck_5, aes(x = Year, y = AIOSO)) +
geom_line(aes(color = Sname)) +
xlab("Years") +
ylab("AIOSO") +
ggtitle("Annual Instit Owner % Shares Out")
grid.arrange(v1, v2, v3, ncol = 3)Here, in this section, the comparisons between stoc companies are perforemd over ASR. The X-axis showing the ASP while the Y-axis showing the ASR. The left Figure showing the first 10 stock companies, while, the right Figure showing the last 10 stock companies. As we can see in the right Figure, in case of TW, the stock company have rank between 0 to 25 along with stock price between 0 to 6000. Similarly, the first 10 companies have low having low stock prices with low rank. The same behaviour can be shown in the rest of the companies.
hlfdf = subset(df,df$Sname %in% unique(df$Sname)[1:10])
hlfdf2 = subset(df,df$Sname %in% unique(df$Sname)[91:100])
k1 = ggplot(hlfdf, aes(x=ASP, y=ASR, color=factor(Sname))) + geom_boxplot() + geom_jitter(width=.1) + facet_wrap(factor(Sname) ~ .) +labs(x='Anual Stock Price', y='Anual Sustain Rank') + theme(legend.position='none')
k2 = ggplot(hlfdf2, aes(x=ASP, y=ASR, color=factor(Sname))) + geom_boxplot() + geom_jitter(width=.1) + facet_wrap(factor(Sname) ~ .) +labs(x='Anual Stock Price', y='Anual Sustain Rank') + theme(legend.position='none')
grid.arrange(k1,k2,ncol=2)Regression using panel data may mitigate omitted variable bias when there is no information on variables that correlate with both the regressors of interest and the independent variable and if these variables are constant in the time dimension or across entities. Provided that panel data is available panel regression methods may improve upon multiple regression models which produce results that are not internally valid in such a setting. So, here we have applied build panel modesl on the given dataset. For the setting we have used Stock price as dependent variable, while, ASP and AIOSO as independent variables.
We have used plm(), a convenient R function that enables us to estimate linear panel regression models which comes with the package plm (Croissant, Millo, and Tappe 2020). Usage of plm() is very similar as for the function lm() which have been used by the majority of the developer in the literature. Sometimes panel data is also called longitudinal data as it adds a temporal dimension to cross-sectional data. Let us have a look at the dataset by checking its structure and listing the first few observations.
head(df)## Sname Year ASP ASR AIOSO
## 1 AAL 2020 2424.5 89.60 110.695
## 2 AAL 2019 2173.0 85.28 108.878
## 3 AAL 2018 1747.8 95.19 126.537
## 4 ABF 2020 1938.5 44.25 96.679
## 5 ABF 2019 2343.0 44.42 96.469
## 6 ABF 2018 2236.0 38.84 90.081
str(df)## 'data.frame': 300 obs. of 5 variables:
## $ Sname: chr "AAL" "AAL" "AAL" "ABF" ...
## $ Year : num 2020 2019 2018 2020 2019 ...
## $ ASP : num 2424 2173 1748 1938 2343 ...
## $ ASR : num 89.6 85.3 95.2 44.2 44.4 ...
## $ AIOSO: num 110.7 108.9 126.5 96.7 96.5 ...
We find that the dataset consists of 300 observations on 5 variables. Since all variables are observed for all entities and over all time periods, the panel is balanced. If there were missing data for at least one entity in at least one time period we would call the panel unbalanced. As in the preprocessing step we have made the data balanced by adding 0 to the null values. Now, the data is in the shape that is suitable for the regression analysis.
#Model 3, OLS Pooled Model with ASP, ASR and AIOSO
mod3 <- plm(ASP ~ ASR + AIOSO,
data = df, index = c("Sname"), model = "pooling")
summary(mod3)## Pooling Model
##
## Call:
## plm(formula = ASP ~ ASR + AIOSO, data = df, model = "pooling",
## index = c("Sname"))
##
## Balanced Panel: n = 100, T = 3, N = 300
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2181.23 -1338.50 -644.25 549.85 13297.10
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 305.0623 548.2243 0.5565 0.578319
## ASR 7.2956 4.5250 1.6123 0.107962
## AIOSO 12.3757 4.5895 2.6965 0.007407 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 1319400000
## Residual Sum of Squares: 1281800000
## R-Squared: 0.028488
## Adj. R-Squared: 0.021946
## F-statistic: 4.35449 on 2 and 297 DF, p-value: 0.013679
yhat <- mod3$df.residual
ggplot(df, aes(x = ASR + AIOSO, y = ASP))+
geom_point() +
geom_smooth(method=lm)## `geom_smooth()` using formula 'y ~ x'
#Model 4, Fixed Effect Model with ASP, ASR and AIOSO
mod4 <- plm(ASP ~ ASR + AIOSO,
data = df, index = c("Sname"), model = "within")
summary(mod4)## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = ASP ~ ASR + AIOSO, data = df, model = "within",
## index = c("Sname"))
##
## Balanced Panel: n = 100, T = 3, N = 300
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -3790.1989 -124.0785 7.5393 143.4727 4795.7467
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## ASR -0.11823 3.29004 -0.0359 0.9714
## AIOSO -4.85512 3.89020 -1.2480 0.2135
##
## Total Sum of Squares: 122300000
## Residual Sum of Squares: 121340000
## R-Squared: 0.0078273
## Adj. R-Squared: -0.49828
## F-statistic: 0.781021 on 2 and 198 DF, p-value: 0.45934
yhat <- mod4$df.residual
ggplot(df, aes(x = ASR + AIOSO, y = ASP))+
geom_point() +
geom_smooth(method=lm)## `geom_smooth()` using formula 'y ~ x'
The first model is Pooled model with ASP, ASR and AIOS, while, the second is fixed effect model with the same variables. In case of fixed effect model, the sampling distribution of the OLS estimator in the fixed effects regression model is normal in large samples. In this case, each stock company is allowed its own intercept, so that the slope on the ASP variable can be interpreted as the effect of a change in the ASR and AIOSO within stock company. The model implicitly assumes that within each stock company, the ASP has the same marginal impact on ASR and AIOSO. The variance of the estimates can be estimated and we can compute standard errors,t-statistics and confidence intervals for coefficients. As you can see, the estimated coeffients for the ASR and AIOSO are negative, where, the estimated value for ASR is -0.11 and for AIOSO is -4.85. Similarly, R suared is 0.007 and P value is 0.45 which is not significant. Results show that the fixed effect model with proportional odds is not good than the null model as indicated by the p-value, which is greater than 0.01.
Similarly, in case of pooled model, the P value is less than or equal to 0.01 which is equal to the null hypothesis ans suggested that pooled model is better than the fixed effect model.
In both plots, each point represents observations of ASP, ASR and AIOSO for a given Stock company in the respective year. The regression results indicate a positive relationship between the ASP, ASR and AIOSO for all the years (i.e., 2018, 2019 and 2020), however, in the fixed effect model it is negative relationship. The estimated coefficient on ASR is possitive and significantly not different from zero at 5%. Its interpretation is that raising the stock price causes ASR and AIOSO to decrease.
To decide between fixed effect or pooled model you can run a Hausman test where the null hypothesis is that the preferred model is pooled model vs. the alternative. It basically tests whether the unique errors are correlated with the regressors, the null hypothesis is they are not. If the p-value is significant (for example <0.05) then use pooled model, if not use fixed effects.
phtest(mod3,mod4)##
## Hausman Test
##
## data: ASP ~ ASR + AIOSO
## chisq = 51.343, df = 2, p-value = 7.096e-12
## alternative hypothesis: one model is inconsistent
In the results, we can see that p value is less than 0.05 which is 7.096e-12. Which is showing that pooled model is better to used.