Data Exercise - Global Investment Bank

Paul Jozefek

2020-08-18

During the interview process with a global investment bank I received the following dataset and questions. The answers were meant to be brief (completed in a few hours) and the analysis reflects this restriction.

Question 1: Which normalization measure would you use as you investigate the relationship between the use of alternative data and the accuracy of analysts forecasts?

Data Exploration

I began by importing the dataset into R and simplifying the column names. The data table is displayed below:

Variable	Description
analyst	Identifies analyst
issuer	Company variable
industry	Identifies the company’s industry
currency	Identifies the company’s currency
earnings	What total earnings turned out to be
days	The number of days between the forecast and earnings announcement
alt	True if alternative data was used
news	Analyst update during a period of extraordinary news flow
report	Which quarter has just been announced relative to updated quarter
fe1	normalized measure of the difference between forecast and actual earnings, Data Scientist 1
fe2	normalized measure of the difference between forecast and actual earnings, Data Scientist 2

Visualizations

We can see from a histogram of forecast error 1 that the normalized errors are clustered near 0, with most below 5. The largest value is just below 21

The normalized errors are also clustered near 0 for forecast error 2, but there are many extreme values, with a maximum near 1,861

It is difficult to see the extreme values for forecast error 2, but we can zoom in by shortening the y-axis

y-axis shortened

We can also see the extreme values for forecast 2 when viewed by alt data

Conclusion to Question 1:The forecast error is supposed to be normalized11 example: % change. We would expect a very wide range of values if it was an absolute deviation, like a dollar amount.22 company earnings vary greatly In this case, forecast error 1 aligns more closely with a normalized value, and so that is what I will analyze.

Question 2: Build a model that describes the relationship between the use of alternative data and forecast error.

Before creating a model I changed report to FALSE if it was an NA or TRUE otherwise, added a column with days as a quartile33 the most negative as quartile 1, and changed currency, industry, issuer, and analyst to factors. I was then able to view how some of these factors impacted the forecast error and differed by alt.

We can see that there tends to be a noticeable improvement in forecast error when alternative data is used, and that in some instances the improvement varies by factor.

There doesn’t appear to be a very strong linear relationship between days and forecast error, but we can see a difference in effect by alternative data

Model

In constructing a model my objective was to analyze which factors influence forecast error 1, with particular attention on the impact from alternative data. I made the follwing assumptions/changes:

I changed report to a boolean “True” or “False”. I thought that was more relevant than the current format.
I converted the qualitative variables to factors.
For my initial simple model I excluded analyst, industry, and issuer. There were too many different levels.
I did not evaluate all possible interactions. I only evaluated the interactions with alternative data.


Call:
lm(formula = fe1 ~ currency + days + alt + news + report + alt:news + 
    alt:report + alt:days + alt:currency, data = forecasts)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7751 -0.4361 -0.2459  0.0667 20.3120 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.5296536  0.0547938   9.666  < 2e-16 ***
currency2          -0.0401229  0.1337458  -0.300 0.764200    
currency3          -0.0846681  0.0718474  -1.178 0.238701    
currency4          -0.1507086  0.1140637  -1.321 0.186500    
days               -0.0008398  0.0002503  -3.355 0.000803 ***
altTRUE            -0.2503118  0.0973847  -2.570 0.010201 *  
newsTRUE           -0.0127584  0.0447830  -0.285 0.775743    
reportTRUE         -0.1564258  0.0448914  -3.485 0.000499 ***
altTRUE:newsTRUE   -0.1023730  0.0795601  -1.287 0.198272    
altTRUE:reportTRUE  0.0893810  0.0806973   1.108 0.268108    
days:altTRUE       -0.0007374  0.0004719  -1.563 0.118228    
currency2:altTRUE   0.0254025  0.2990128   0.085 0.932302    
currency3:altTRUE   0.1033533  0.1133841   0.912 0.362078    
currency4:altTRUE   0.0868071  0.1757551   0.494 0.621401    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 3470 degrees of freedom
Multiple R-squared:  0.01789,   Adjusted R-squared:  0.01421 
F-statistic: 4.862 on 13 and 3470 DF,  p-value: 1.691e-08

Analysis of Model Output

We can see from the p-values44 many >0.05, which is the typical cutoff when evaluating statistical significance that most ceofficients are not significant. Instead of visually selecting which to remove, I used step-wise selection to find the best model.

Step-Wise Selection

Stepwise regression iteratively adds and removes predictors until the best model is found. It begins with only the intercept, adds the best variable55 as measured by the impact to R-squared, and continues to add variables in order of their impact. If a variable becomes insignificant as more variables are added66 the p-value becomes too high it will be removed. The process continues until the best model is found.

The process is presented below:

Start:  AIC=436.3
fe1 ~ 1

           Df Sum of Sq    RSS    AIC
+ days      1    32.910 3913.6 409.13
+ alt       1    15.735 3930.8 424.38
+ report    1    12.929 3933.6 426.87
+ news      1     4.279 3942.3 434.52
<none>                  3946.5 436.30
+ currency  3     4.289 3942.2 438.51

Step:  AIC=409.13
fe1 ~ days

           Df Sum of Sq    RSS    AIC
+ alt       1    13.103 3900.5 399.44
+ report    1    13.080 3900.5 399.46
<none>                  3913.6 409.13
+ news      1     2.135 3911.5 409.23
+ currency  3     3.515 3910.1 412.00
- days      1    32.910 3946.5 436.30

Step:  AIC=399.44
fe1 ~ days + alt

           Df Sum of Sq    RSS    AIC
+ report    1   12.9488 3887.6 389.86
+ days:alt  1    2.7464 3897.8 398.99
<none>                  3900.5 399.44
+ news      1    1.7893 3898.7 399.84
+ currency  3    2.3119 3898.2 403.38
- alt       1   13.1030 3913.6 409.13
- days      1   30.2786 3930.8 424.38

Step:  AIC=389.86
fe1 ~ days + alt + report

             Df Sum of Sq    RSS    AIC
+ days:alt    1    2.9610 3884.6 389.20
<none>                    3887.6 389.86
+ news        1    1.7942 3885.8 390.25
+ alt:report  1    1.5094 3886.1 390.50
+ currency    3    2.5756 3885.0 393.55
- report      1   12.9488 3900.5 399.44
- alt         1   12.9714 3900.5 399.46
- days        1   30.4351 3918.0 415.03

Step:  AIC=389.2
fe1 ~ days + alt + report + days:alt

             Df Sum of Sq    RSS    AIC
<none>                    3884.6 389.20
+ news        1    1.8338 3882.8 389.56
- days:alt    1    2.9610 3887.6 389.86
+ alt:report  1    1.4290 3883.2 389.92
+ currency    3    2.9570 3881.7 392.55
- report      1   13.1633 3897.8 398.99

Best Model


Call:
lm(formula = fe1 ~ days + alt + report + days:alt, data = forecasts)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7505 -0.4345 -0.2528  0.0770 20.3329 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.4919317  0.0477794  10.296  < 2e-16 ***
days         -0.0008837  0.0002476  -3.570 0.000362 ***
altTRUE      -0.2503091  0.0822265  -3.044 0.002351 ** 
reportTRUE   -0.1277128  0.0371961  -3.433 0.000603 ***
days:altTRUE -0.0007624  0.0004682  -1.628 0.103519    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 3479 degrees of freedom
Multiple R-squared:  0.01569,   Adjusted R-squared:  0.01456 
F-statistic: 13.86 on 4 and 3479 DF,  p-value: 3.193e-11

Residual Diagnostics

Unfortunately the model suffers from two big problems. A very low adjusted R-squared77 Adj. R-squared 0.01456, and poor residual diagnostics. The residuals should have a constant mean, a constant variance, and no clear pattern. We can see from the plot that this is not the case.

The residuals do not have constant variance

Questions

Do forecasts for outcomes that are far off in the future tend to be more/less accurate than for outcomes that are sooner? If so, how would you quantify the difference?

Forecasts that are far off tend to be less accurate, but we can see from the scatterplot of days vs. fe188 Figure 5 that the relationship isn’t strong. The best linear model includes days as a statistically significant indicator, but the coefficient is minimal at -0.0008837. Since it is multiplied by days, which is a negative number, a date that is farther off will raise the forecast error by a greater amount.

Do forecasts that are based on alternative data tend to be more/less accurate than similar forecasts that are not? If so, how would you quantify the difference?

Forecasts based on alternative tend to be more accurate. It is evident from the plots99 Boxplots on page 4 and from the linear model, where the coefficient is -0.2503091 for a value of “TRUE”.

Discuss at least two patterns that you think are important to understanding the nature of the relationship between the use of alternative data and the accuracy of analysts’ forecasts. Explain why these are important.

We have determined that alternative data is an important factor in evaluating forecast error, but it is useful to explore whether the use of alternative data differs by another factor. To judge this I visualized the interactions with plots and included interaction terms in the linear regresson model.
Most interactions are not statistically significant, but days:altTRUE has been included in the final model, which signifies that the effect of alternative data differs by days. In this case the coefficient is negative1010 days:altTRUE coefficent = negative 0.0007624, but it can’t be viewed independently because altTRUE is also included in the model. The net impact of alternative data is an improvement in forecast error.
Although I didn’t include analyst and industry in the linear model1111 100 analysts and 112 industries we can see from scatterplots that the effect of alternative data does differ.

Scatterplots of Industry and Analyst

If we could narrow down these effects to determine why there are differences the use of alternative data could be tailored.

Question 3: Provide a summary of your findings and guidance on what to do next.

It is clear that an investment in alternative data is beneficial. We can see both visually and statistically that it is an important indiactor of forecasting error, with its use leading to improved outcomes.

However, the model might be improved if additional data is analyzed

It is possible that the types of data available differ by industry, but unfortunately this dataset has 112 groups. It would be more beneficial to segment the data into the main eleven GICS sectors1212 Energy, Materials, Industrials, Consumer Discretionary, Consumer Staples, Health Care, Financials, Information Technology, Communication Services, Utilities, Real Estate. It would then be possible to examine which types of data are the most useful.
This dataset only evaluates the forecast error for one year. If alternative data helps provide some insight into the long term outlook for a company it would not be adequately captured. Future forecasts, price targets, and stock ratings could all change. These are all meaningful attributes and should be studied.
An analyst might not use alternative data to change earnings or ratings. They may use it in a research note to educate clients. That could lead to higher commissions. If it was possible to track client interactions1313 phone calls or meetings or commissions, before and after, that would be helpful.
An analyst might use alternative data to help educate them on a stock or industry before launching coverage. That would be difficult to capture without an analyst satisfaction variable.

R Code

Written in R Markdown

Import, simplify, and display the dataset

library(DT)
#Import as CSV
forecasts <- read.csv("~/UBS/forecasts.csv")
forecasts <-forecasts[,-1] #Remove row num column
colnames(forecasts)[5] <- "earnings" #simplify column name
colnames(forecasts)[6] <- "days" #simplify column name
colnames(forecasts)[7] <- "alt" #simplify column name
colnames(forecasts)[8] <- "news" #simplify column name
colnames(forecasts)[9] <- "report" #simplify column name
colnames(forecasts)[10] <- "fe1" #simplify column name
colnames(forecasts)[11] <- "fe2" #simplify column name
#Create Datatable
datatable(forecasts, filter="top", 
      options = list(pageLength = 10, scrollX=T) )

Histogram Fe1

library(tidyverse)
#histogram fe1
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe1
  ), binwidth=1,fill="red") + 
  xlab("forecaset error 1") +
  ggtitle("Histogram fe1") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))

Histogram Fe2

#histogram fe2
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe2
  ), binwidth=5,fill="red") + 
  xlab("forecaset error 2") +
  ggtitle("Histogram fe2") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))

Histogram Fe2(Outliers)

#histogram fe2 outliers
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe2
  ), binwidth=5,fill="red")+
  coord_cartesian(ylim=c(0,5))+
  xlab("forecast error 2") +
  ggtitle("Histogram fe2 (Outliers)") + 
  theme(plot.title = element_text(hjust = 0.5))+
   theme(axis.title=element_text(size=5))+
  theme(axis.text=element_text(size=5))+
  theme(plot.title = element_text(size=5))

Boxplots

library(gridExtra)

#Boxplot fe1
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1))+
  geom_boxplot(color="blue")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Boxplot fe1") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot Fe2
ggplot(data=forecasts, mapping = aes(x=alt, y=fe2))+
  geom_boxplot(color="blue")+ 
  xlab("alt data") +
  ylab("forecast error 2")+
  ggtitle("Boxplot fe2") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))  ->p2

grid.arrange(p1, p2, ncol = 2)

Boxplots by Factor

#Change values to factors
forecasts$currency = as.factor(forecasts$currency)
forecasts$report = ifelse(is.na(forecasts$report),
                          "FALSE","TRUE")
forecasts$report = as.logical(forecasts$report)
forecasts$industry = as.factor(forecasts$industry)
forecasts$issuer = as.factor(forecasts$issuer)
forecasts$analyst = as.factor(forecasts$analyst)

#Boxplot color by news
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=news))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by News") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot color by report
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=report))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Report") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p2

#Boxplot color by currency
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=currency))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Currency") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p3

#Make quartiles for dates
forecasts <- forecasts %>% mutate(quartile = 
                                    ntile(days, 4))
forecasts$quartile = as.factor(forecasts$quartile)

#Boxplot color by quartile
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=quartile))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Days Quartile") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p4


grid.arrange(p1, p2, p3, p4, ncol = 2, nrow=2)

Scatterplot of days and fe1 by alt

ggplot(data=forecasts)+geom_point(mapping = 
                    aes(x=days, y=fe1, color=alt))+ 
  ylab("forecast error 1")+
  ggtitle("Fe1 by Days") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))

Initial Linear Regression Model

#LM model
lm.fit = lm(fe1~currency+days+alt+news+report+
              alt:news+alt:report+alt:days+
              alt:currency,data=forecasts)
summary(lm.fit)

Stepwise Selection

library(leaps)
nullmod = lm(fe1~1, data=forecasts) # Intercept only model
fullmod = lm(fe1~currency+days+alt+news+report+
               alt:news+alt:report+alt:days+
               alt:currency,data=forecasts) # all parameters
stepwisereg = step(nullmod, scope=list(lower=nullmod, 
                    upper=fullmod), direction='both')

Best Model

summary(stepwisereg)

Residual Diagnostics

# plot residuals
plot(stepwisereg, which=1)

Scatterplots of Industry and Analyst

#Scatter Analyst vs Fe1
ggplot(data=forecasts)+geom_point(mapping = 
                aes(x=analyst, y=fe1, color=alt))+
theme(axis.text.x=element_blank(),
      axis.ticks.x=element_blank())+ 
  ylab("forecast error 1")+
  ggtitle("Analyst vs. Fe1 by alt") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot Fe2
#Scatter Industry vs Fe1
ggplot(data=forecasts)+geom_point(mapping = 
                aes(x=industry, y=fe1, color=alt))+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+ 
  ylab("forecast error 1")+
  ggtitle("Industry vs. Fe1 by alt") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p2

grid.arrange(p1, p2, ncol = 2)