Data Exercise - Global Investment Bank

Paul Jozefek

2020-08-18


During the interview process with a global investment bank I received the following dataset and questions. The answers were meant to be brief (completed in a few hours) and the analysis reflects this restriction.


Question 1: Which normalization measure would you use as you investigate the relationship between the use of alternative data and the accuracy of analysts forecasts?

Data Exploration

I began by importing the dataset into R and simplifying the column names. The data table is displayed below:


Variable Description
analyst Identifies analyst
issuer Company variable
industry Identifies the company’s industry
currency Identifies the company’s currency
earnings What total earnings turned out to be
days The number of days between the forecast and earnings announcement
alt True if alternative data was used
news Analyst update during a period of extraordinary news flow
report Which quarter has just been announced relative to updated quarter
fe1 normalized measure of the difference between forecast and actual earnings, Data Scientist 1
fe2 normalized measure of the difference between forecast and actual earnings, Data Scientist 2

Visualizations

We can see from a histogram of forecast error 1 that the normalized errors are clustered near 0, with most below 5. The largest value is just below 21

We can see from a histogram of forecast error 1 that the normalized errors are clustered near 0, with most below 5. The largest value is just below 21

The normalized errors are also clustered near 0 for forecast error 2, but there are many extreme values, with a maximum near 1,861

The normalized errors are also clustered near 0 for forecast error 2, but there are many extreme values, with a maximum near 1,861

It is difficult to see the extreme values for forecast error 2, but we can zoom in by shortening the y-axis

y-axis shortened

y-axis shortened

 We can also see the extreme values for forecast 2 when viewed by alt data

We can also see the extreme values for forecast 2 when viewed by alt data


Conclusion to Question 1:The forecast error is supposed to be normalized1 example: % change. We would expect a very wide range of values if it was an absolute deviation, like a dollar amount.2 company earnings vary greatly In this case, forecast error 1 aligns more closely with a normalized value, and so that is what I will analyze.

Question 2: Build a model that describes the relationship between the use of alternative data and forecast error.

Before creating a model I changed report to FALSE if it was an NA or TRUE otherwise, added a column with days as a quartile3 the most negative as quartile 1, and changed currency, industry, issuer, and analyst to factors. I was then able to view how some of these factors impacted the forecast error and differed by alt.

We can see that there tends to be a noticeable improvement in forecast error when alternative data is used, and that in some instances the improvement varies by factor.

There doesn’t appear to be a very strong linear relationship between days and forecast error, but we can see a difference in effect by alternative data

There doesn't appear to be a very strong linear relationship between days and forecast error, but we can see a difference in effect by alternative data

Model

In constructing a model my objective was to analyze which factors influence forecast error 1, with particular attention on the impact from alternative data. I made the follwing assumptions/changes:


Call:
lm(formula = fe1 ~ currency + days + alt + news + report + alt:news + 
    alt:report + alt:days + alt:currency, data = forecasts)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7751 -0.4361 -0.2459  0.0667 20.3120 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.5296536  0.0547938   9.666  < 2e-16 ***
currency2          -0.0401229  0.1337458  -0.300 0.764200    
currency3          -0.0846681  0.0718474  -1.178 0.238701    
currency4          -0.1507086  0.1140637  -1.321 0.186500    
days               -0.0008398  0.0002503  -3.355 0.000803 ***
altTRUE            -0.2503118  0.0973847  -2.570 0.010201 *  
newsTRUE           -0.0127584  0.0447830  -0.285 0.775743    
reportTRUE         -0.1564258  0.0448914  -3.485 0.000499 ***
altTRUE:newsTRUE   -0.1023730  0.0795601  -1.287 0.198272    
altTRUE:reportTRUE  0.0893810  0.0806973   1.108 0.268108    
days:altTRUE       -0.0007374  0.0004719  -1.563 0.118228    
currency2:altTRUE   0.0254025  0.2990128   0.085 0.932302    
currency3:altTRUE   0.1033533  0.1133841   0.912 0.362078    
currency4:altTRUE   0.0868071  0.1757551   0.494 0.621401    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 3470 degrees of freedom
Multiple R-squared:  0.01789,   Adjusted R-squared:  0.01421 
F-statistic: 4.862 on 13 and 3470 DF,  p-value: 1.691e-08

Analysis of Model Output

We can see from the p-values4 many >0.05, which is the typical cutoff when evaluating statistical significance that most ceofficients are not significant. Instead of visually selecting which to remove, I used step-wise selection to find the best model.

Step-Wise Selection

Stepwise regression iteratively adds and removes predictors until the best model is found. It begins with only the intercept, adds the best variable5 as measured by the impact to R-squared, and continues to add variables in order of their impact. If a variable becomes insignificant as more variables are added6 the p-value becomes too high it will be removed. The process continues until the best model is found.

The process is presented below:

Start:  AIC=436.3
fe1 ~ 1

           Df Sum of Sq    RSS    AIC
+ days      1    32.910 3913.6 409.13
+ alt       1    15.735 3930.8 424.38
+ report    1    12.929 3933.6 426.87
+ news      1     4.279 3942.3 434.52
<none>                  3946.5 436.30
+ currency  3     4.289 3942.2 438.51

Step:  AIC=409.13
fe1 ~ days

           Df Sum of Sq    RSS    AIC
+ alt       1    13.103 3900.5 399.44
+ report    1    13.080 3900.5 399.46
<none>                  3913.6 409.13
+ news      1     2.135 3911.5 409.23
+ currency  3     3.515 3910.1 412.00
- days      1    32.910 3946.5 436.30

Step:  AIC=399.44
fe1 ~ days + alt

           Df Sum of Sq    RSS    AIC
+ report    1   12.9488 3887.6 389.86
+ days:alt  1    2.7464 3897.8 398.99
<none>                  3900.5 399.44
+ news      1    1.7893 3898.7 399.84
+ currency  3    2.3119 3898.2 403.38
- alt       1   13.1030 3913.6 409.13
- days      1   30.2786 3930.8 424.38

Step:  AIC=389.86
fe1 ~ days + alt + report

             Df Sum of Sq    RSS    AIC
+ days:alt    1    2.9610 3884.6 389.20
<none>                    3887.6 389.86
+ news        1    1.7942 3885.8 390.25
+ alt:report  1    1.5094 3886.1 390.50
+ currency    3    2.5756 3885.0 393.55
- report      1   12.9488 3900.5 399.44
- alt         1   12.9714 3900.5 399.46
- days        1   30.4351 3918.0 415.03

Step:  AIC=389.2
fe1 ~ days + alt + report + days:alt

             Df Sum of Sq    RSS    AIC
<none>                    3884.6 389.20
+ news        1    1.8338 3882.8 389.56
- days:alt    1    2.9610 3887.6 389.86
+ alt:report  1    1.4290 3883.2 389.92
+ currency    3    2.9570 3881.7 392.55
- report      1   13.1633 3897.8 398.99

Best Model


Call:
lm(formula = fe1 ~ days + alt + report + days:alt, data = forecasts)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7505 -0.4345 -0.2528  0.0770 20.3329 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.4919317  0.0477794  10.296  < 2e-16 ***
days         -0.0008837  0.0002476  -3.570 0.000362 ***
altTRUE      -0.2503091  0.0822265  -3.044 0.002351 ** 
reportTRUE   -0.1277128  0.0371961  -3.433 0.000603 ***
days:altTRUE -0.0007624  0.0004682  -1.628 0.103519    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 3479 degrees of freedom
Multiple R-squared:  0.01569,   Adjusted R-squared:  0.01456 
F-statistic: 13.86 on 4 and 3479 DF,  p-value: 3.193e-11

Residual Diagnostics

Unfortunately the model suffers from two big problems. A very low adjusted R-squared7 Adj. R-squared 0.01456, and poor residual diagnostics. The residuals should have a constant mean, a constant variance, and no clear pattern. We can see from the plot that this is not the case.

The residuals do not have constant variance

The residuals do not have constant variance

Questions

Do forecasts for outcomes that are far off in the future tend to be more/less accurate than for outcomes that are sooner? If so, how would you quantify the difference?

Do forecasts that are based on alternative data tend to be more/less accurate than similar forecasts that are not? If so, how would you quantify the difference?

Discuss at least two patterns that you think are important to understanding the nature of the relationship between the use of alternative data and the accuracy of analysts’ forecasts. Explain why these are important.

Scatterplots of Industry and Analyst

Scatterplots of Industry and Analyst

Question 3: Provide a summary of your findings and guidance on what to do next.

It is clear that an investment in alternative data is beneficial. We can see both visually and statistically that it is an important indiactor of forecasting error, with its use leading to improved outcomes.

However, the model might be improved if additional data is analyzed

R Code

Written in R Markdown

Import, simplify, and display the dataset

library(DT)
#Import as CSV
forecasts <- read.csv("~/UBS/forecasts.csv")
forecasts <-forecasts[,-1] #Remove row num column
colnames(forecasts)[5] <- "earnings" #simplify column name
colnames(forecasts)[6] <- "days" #simplify column name
colnames(forecasts)[7] <- "alt" #simplify column name
colnames(forecasts)[8] <- "news" #simplify column name
colnames(forecasts)[9] <- "report" #simplify column name
colnames(forecasts)[10] <- "fe1" #simplify column name
colnames(forecasts)[11] <- "fe2" #simplify column name
#Create Datatable
datatable(forecasts, filter="top", 
      options = list(pageLength = 10, scrollX=T) )


Histogram Fe1

library(tidyverse)
#histogram fe1
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe1
  ), binwidth=1,fill="red") + 
  xlab("forecaset error 1") +
  ggtitle("Histogram fe1") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))


Histogram Fe2

#histogram fe2
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe2
  ), binwidth=5,fill="red") + 
  xlab("forecaset error 2") +
  ggtitle("Histogram fe2") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))


Histogram Fe2(Outliers)

#histogram fe2 outliers
ggplot(data = forecasts) + 
  geom_histogram(mapping = aes(x=fe2
  ), binwidth=5,fill="red")+
  coord_cartesian(ylim=c(0,5))+
  xlab("forecast error 2") +
  ggtitle("Histogram fe2 (Outliers)") + 
  theme(plot.title = element_text(hjust = 0.5))+
   theme(axis.title=element_text(size=5))+
  theme(axis.text=element_text(size=5))+
  theme(plot.title = element_text(size=5))


Boxplots

library(gridExtra)

#Boxplot fe1
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1))+
  geom_boxplot(color="blue")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Boxplot fe1") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot Fe2
ggplot(data=forecasts, mapping = aes(x=alt, y=fe2))+
  geom_boxplot(color="blue")+ 
  xlab("alt data") +
  ylab("forecast error 2")+
  ggtitle("Boxplot fe2") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))  ->p2

grid.arrange(p1, p2, ncol = 2)


Boxplots by Factor

#Change values to factors
forecasts$currency = as.factor(forecasts$currency)
forecasts$report = ifelse(is.na(forecasts$report),
                          "FALSE","TRUE")
forecasts$report = as.logical(forecasts$report)
forecasts$industry = as.factor(forecasts$industry)
forecasts$issuer = as.factor(forecasts$issuer)
forecasts$analyst = as.factor(forecasts$analyst)

#Boxplot color by news
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=news))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by News") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot color by report
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=report))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Report") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p2

#Boxplot color by currency
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=currency))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Currency") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p3

#Make quartiles for dates
forecasts <- forecasts %>% mutate(quartile = 
                                    ntile(days, 4))
forecasts$quartile = as.factor(forecasts$quartile)

#Boxplot color by quartile
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1, 
                                     color=quartile))+
  geom_boxplot(fill="peachpuff")+ 
  xlab("alt data") +
  ylab("forecast error 1")+
  ggtitle("Color by Days Quartile") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p4


grid.arrange(p1, p2, p3, p4, ncol = 2, nrow=2)


Scatterplot of days and fe1 by alt

ggplot(data=forecasts)+geom_point(mapping = 
                    aes(x=days, y=fe1, color=alt))+ 
  ylab("forecast error 1")+
  ggtitle("Fe1 by Days") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))


Initial Linear Regression Model

#LM model
lm.fit = lm(fe1~currency+days+alt+news+report+
              alt:news+alt:report+alt:days+
              alt:currency,data=forecasts)
summary(lm.fit)


Stepwise Selection

library(leaps)
nullmod = lm(fe1~1, data=forecasts) # Intercept only model
fullmod = lm(fe1~currency+days+alt+news+report+
               alt:news+alt:report+alt:days+
               alt:currency,data=forecasts) # all parameters
stepwisereg = step(nullmod, scope=list(lower=nullmod, 
                    upper=fullmod), direction='both')


Best Model

summary(stepwisereg)


Residual Diagnostics

# plot residuals
plot(stepwisereg, which=1)


Scatterplots of Industry and Analyst

#Scatter Analyst vs Fe1
ggplot(data=forecasts)+geom_point(mapping = 
                aes(x=analyst, y=fe1, color=alt))+
theme(axis.text.x=element_blank(),
      axis.ticks.x=element_blank())+ 
  ylab("forecast error 1")+
  ggtitle("Analyst vs. Fe1 by alt") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p1

#Boxplot Fe2
#Scatter Industry vs Fe1
ggplot(data=forecasts)+geom_point(mapping = 
                aes(x=industry, y=fe1, color=alt))+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+ 
  ylab("forecast error 1")+
  ggtitle("Industry vs. Fe1 by alt") + 
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.title = element_text(size=10))->p2

grid.arrange(p1, p2, ncol = 2)