During the interview process with a global investment bank I received the following dataset and questions. The answers were meant to be brief (completed in a few hours) and the analysis reflects this restriction.
Question 1: Which normalization measure would you use as you investigate the relationship between the use of alternative data and the accuracy of analysts forecasts?
Data Exploration
I began by importing the dataset into R and simplifying the column names. The data table is displayed below:
| Variable | Description |
|---|---|
| analyst | Identifies analyst |
| issuer | Company variable |
| industry | Identifies the company’s industry |
| currency | Identifies the company’s currency |
| earnings | What total earnings turned out to be |
| days | The number of days between the forecast and earnings announcement |
| alt | True if alternative data was used |
| news | Analyst update during a period of extraordinary news flow |
| report | Which quarter has just been announced relative to updated quarter |
| fe1 | normalized measure of the difference between forecast and actual earnings, Data Scientist 1 |
| fe2 | normalized measure of the difference between forecast and actual earnings, Data Scientist 2 |
Visualizations
We can see from a histogram of forecast error 1 that the normalized errors are clustered near 0, with most below 5. The largest value is just below 21
The normalized errors are also clustered near 0 for forecast error 2, but there are many extreme values, with a maximum near 1,861
It is difficult to see the extreme values for forecast error 2, but we can zoom in by shortening the y-axis
y-axis shortened
We can also see the extreme values for forecast 2 when viewed by alt data
Conclusion to Question 1:The forecast error is supposed to be normalized1 example: % change. We would expect a very wide range of values if it was an absolute deviation, like a dollar amount.2 company earnings vary greatly In this case, forecast error 1 aligns more closely with a normalized value, and so that is what I will analyze.
Question 2: Build a model that describes the relationship between the use of alternative data and forecast error.
Before creating a model I changed report to FALSE if it was an NA or TRUE otherwise, added a column with days as a quartile3 the most negative as quartile 1, and changed currency, industry, issuer, and analyst to factors. I was then able to view how some of these factors impacted the forecast error and differed by alt.
We can see that there tends to be a noticeable improvement in forecast error when alternative data is used, and that in some instances the improvement varies by factor.
There doesn’t appear to be a very strong linear relationship between days and forecast error, but we can see a difference in effect by alternative data
Model
In constructing a model my objective was to analyze which factors influence forecast error 1, with particular attention on the impact from alternative data. I made the follwing assumptions/changes:
I changed report to a boolean “True” or “False”. I thought that was more relevant than the current format.
I converted the qualitative variables to factors.
For my initial simple model I excluded analyst, industry, and issuer. There were too many different levels.
I did not evaluate all possible interactions. I only evaluated the interactions with alternative data.
Call:
lm(formula = fe1 ~ currency + days + alt + news + report + alt:news +
alt:report + alt:days + alt:currency, data = forecasts)
Residuals:
Min 1Q Median 3Q Max
-0.7751 -0.4361 -0.2459 0.0667 20.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5296536 0.0547938 9.666 < 2e-16 ***
currency2 -0.0401229 0.1337458 -0.300 0.764200
currency3 -0.0846681 0.0718474 -1.178 0.238701
currency4 -0.1507086 0.1140637 -1.321 0.186500
days -0.0008398 0.0002503 -3.355 0.000803 ***
altTRUE -0.2503118 0.0973847 -2.570 0.010201 *
newsTRUE -0.0127584 0.0447830 -0.285 0.775743
reportTRUE -0.1564258 0.0448914 -3.485 0.000499 ***
altTRUE:newsTRUE -0.1023730 0.0795601 -1.287 0.198272
altTRUE:reportTRUE 0.0893810 0.0806973 1.108 0.268108
days:altTRUE -0.0007374 0.0004719 -1.563 0.118228
currency2:altTRUE 0.0254025 0.2990128 0.085 0.932302
currency3:altTRUE 0.1033533 0.1133841 0.912 0.362078
currency4:altTRUE 0.0868071 0.1757551 0.494 0.621401
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.057 on 3470 degrees of freedom
Multiple R-squared: 0.01789, Adjusted R-squared: 0.01421
F-statistic: 4.862 on 13 and 3470 DF, p-value: 1.691e-08
Analysis of Model Output
We can see from the p-values4 many >0.05, which is the typical cutoff when evaluating statistical significance that most ceofficients are not significant. Instead of visually selecting which to remove, I used step-wise selection to find the best model.
Step-Wise Selection
Stepwise regression iteratively adds and removes predictors until the best model is found. It begins with only the intercept, adds the best variable5 as measured by the impact to R-squared, and continues to add variables in order of their impact. If a variable becomes insignificant as more variables are added6 the p-value becomes too high it will be removed. The process continues until the best model is found.
The process is presented below:
Start: AIC=436.3
fe1 ~ 1
Df Sum of Sq RSS AIC
+ days 1 32.910 3913.6 409.13
+ alt 1 15.735 3930.8 424.38
+ report 1 12.929 3933.6 426.87
+ news 1 4.279 3942.3 434.52
<none> 3946.5 436.30
+ currency 3 4.289 3942.2 438.51
Step: AIC=409.13
fe1 ~ days
Df Sum of Sq RSS AIC
+ alt 1 13.103 3900.5 399.44
+ report 1 13.080 3900.5 399.46
<none> 3913.6 409.13
+ news 1 2.135 3911.5 409.23
+ currency 3 3.515 3910.1 412.00
- days 1 32.910 3946.5 436.30
Step: AIC=399.44
fe1 ~ days + alt
Df Sum of Sq RSS AIC
+ report 1 12.9488 3887.6 389.86
+ days:alt 1 2.7464 3897.8 398.99
<none> 3900.5 399.44
+ news 1 1.7893 3898.7 399.84
+ currency 3 2.3119 3898.2 403.38
- alt 1 13.1030 3913.6 409.13
- days 1 30.2786 3930.8 424.38
Step: AIC=389.86
fe1 ~ days + alt + report
Df Sum of Sq RSS AIC
+ days:alt 1 2.9610 3884.6 389.20
<none> 3887.6 389.86
+ news 1 1.7942 3885.8 390.25
+ alt:report 1 1.5094 3886.1 390.50
+ currency 3 2.5756 3885.0 393.55
- report 1 12.9488 3900.5 399.44
- alt 1 12.9714 3900.5 399.46
- days 1 30.4351 3918.0 415.03
Step: AIC=389.2
fe1 ~ days + alt + report + days:alt
Df Sum of Sq RSS AIC
<none> 3884.6 389.20
+ news 1 1.8338 3882.8 389.56
- days:alt 1 2.9610 3887.6 389.86
+ alt:report 1 1.4290 3883.2 389.92
+ currency 3 2.9570 3881.7 392.55
- report 1 13.1633 3897.8 398.99
Best Model
Call:
lm(formula = fe1 ~ days + alt + report + days:alt, data = forecasts)
Residuals:
Min 1Q Median 3Q Max
-0.7505 -0.4345 -0.2528 0.0770 20.3329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4919317 0.0477794 10.296 < 2e-16 ***
days -0.0008837 0.0002476 -3.570 0.000362 ***
altTRUE -0.2503091 0.0822265 -3.044 0.002351 **
reportTRUE -0.1277128 0.0371961 -3.433 0.000603 ***
days:altTRUE -0.0007624 0.0004682 -1.628 0.103519
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.057 on 3479 degrees of freedom
Multiple R-squared: 0.01569, Adjusted R-squared: 0.01456
F-statistic: 13.86 on 4 and 3479 DF, p-value: 3.193e-11
Residual Diagnostics
Unfortunately the model suffers from two big problems. A very low adjusted R-squared7 Adj. R-squared 0.01456, and poor residual diagnostics. The residuals should have a constant mean, a constant variance, and no clear pattern. We can see from the plot that this is not the case.
The residuals do not have constant variance
Questions
Do forecasts for outcomes that are far off in the future tend to be more/less accurate than for outcomes that are sooner? If so, how would you quantify the difference?
days vs. fe18 Figure 5 that the relationship isn’t strong. The best linear model includes days as a statistically significant indicator, but the coefficient is minimal at -0.0008837. Since it is multiplied by days, which is a negative number, a date that is farther off will raise the forecast error by a greater amount.Do forecasts that are based on alternative data tend to be more/less accurate than similar forecasts that are not? If so, how would you quantify the difference?
Discuss at least two patterns that you think are important to understanding the nature of the relationship between the use of alternative data and the accuracy of analysts’ forecasts. Explain why these are important.
We have determined that alternative data is an important factor in evaluating forecast error, but it is useful to explore whether the use of alternative data differs by another factor. To judge this I visualized the interactions with plots and included interaction terms in the linear regresson model.
Most interactions are not statistically significant, but days:altTRUE has been included in the final model, which signifies that the effect of alternative data differs by days. In this case the coefficient is negative10 days:altTRUE coefficent = negative 0.0007624, but it can’t be viewed independently because altTRUE is also included in the model. The net impact of alternative data is an improvement in forecast error.
Although I didn’t include analyst and industry in the linear model11 100 analysts and 112 industries we can see from scatterplots that the effect of alternative data does differ.
Scatterplots of Industry and Analyst
Question 3: Provide a summary of your findings and guidance on what to do next.
It is clear that an investment in alternative data is beneficial. We can see both visually and statistically that it is an important indiactor of forecasting error, with its use leading to improved outcomes.
However, the model might be improved if additional data is analyzed
It is possible that the types of data available differ by industry, but unfortunately this dataset has 112 groups. It would be more beneficial to segment the data into the main eleven GICS sectors12 Energy, Materials, Industrials, Consumer Discretionary, Consumer Staples, Health Care, Financials, Information Technology, Communication Services, Utilities, Real Estate. It would then be possible to examine which types of data are the most useful.
This dataset only evaluates the forecast error for one year. If alternative data helps provide some insight into the long term outlook for a company it would not be adequately captured. Future forecasts, price targets, and stock ratings could all change. These are all meaningful attributes and should be studied.
An analyst might not use alternative data to change earnings or ratings. They may use it in a research note to educate clients. That could lead to higher commissions. If it was possible to track client interactions13 phone calls or meetings or commissions, before and after, that would be helpful.
An analyst might use alternative data to help educate them on a stock or industry before launching coverage. That would be difficult to capture without an analyst satisfaction variable.
R Code
Written in R Markdown
Import, simplify, and display the dataset
library(DT)
#Import as CSV
forecasts <- read.csv("~/UBS/forecasts.csv")
forecasts <-forecasts[,-1] #Remove row num column
colnames(forecasts)[5] <- "earnings" #simplify column name
colnames(forecasts)[6] <- "days" #simplify column name
colnames(forecasts)[7] <- "alt" #simplify column name
colnames(forecasts)[8] <- "news" #simplify column name
colnames(forecasts)[9] <- "report" #simplify column name
colnames(forecasts)[10] <- "fe1" #simplify column name
colnames(forecasts)[11] <- "fe2" #simplify column name
#Create Datatable
datatable(forecasts, filter="top",
options = list(pageLength = 10, scrollX=T) )
Histogram Fe1
library(tidyverse)
#histogram fe1
ggplot(data = forecasts) +
geom_histogram(mapping = aes(x=fe1
), binwidth=1,fill="red") +
xlab("forecaset error 1") +
ggtitle("Histogram fe1") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))
Histogram Fe2
#histogram fe2
ggplot(data = forecasts) +
geom_histogram(mapping = aes(x=fe2
), binwidth=5,fill="red") +
xlab("forecaset error 2") +
ggtitle("Histogram fe2") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))
Histogram Fe2(Outliers)
#histogram fe2 outliers
ggplot(data = forecasts) +
geom_histogram(mapping = aes(x=fe2
), binwidth=5,fill="red")+
coord_cartesian(ylim=c(0,5))+
xlab("forecast error 2") +
ggtitle("Histogram fe2 (Outliers)") +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.title=element_text(size=5))+
theme(axis.text=element_text(size=5))+
theme(plot.title = element_text(size=5))
Boxplots
library(gridExtra)
#Boxplot fe1
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1))+
geom_boxplot(color="blue")+
xlab("alt data") +
ylab("forecast error 1")+
ggtitle("Boxplot fe1") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p1
#Boxplot Fe2
ggplot(data=forecasts, mapping = aes(x=alt, y=fe2))+
geom_boxplot(color="blue")+
xlab("alt data") +
ylab("forecast error 2")+
ggtitle("Boxplot fe2") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10)) ->p2
grid.arrange(p1, p2, ncol = 2)
Boxplots by Factor
#Change values to factors
forecasts$currency = as.factor(forecasts$currency)
forecasts$report = ifelse(is.na(forecasts$report),
"FALSE","TRUE")
forecasts$report = as.logical(forecasts$report)
forecasts$industry = as.factor(forecasts$industry)
forecasts$issuer = as.factor(forecasts$issuer)
forecasts$analyst = as.factor(forecasts$analyst)
#Boxplot color by news
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1,
color=news))+
geom_boxplot(fill="peachpuff")+
xlab("alt data") +
ylab("forecast error 1")+
ggtitle("Color by News") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p1
#Boxplot color by report
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1,
color=report))+
geom_boxplot(fill="peachpuff")+
xlab("alt data") +
ylab("forecast error 1")+
ggtitle("Color by Report") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p2
#Boxplot color by currency
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1,
color=currency))+
geom_boxplot(fill="peachpuff")+
xlab("alt data") +
ylab("forecast error 1")+
ggtitle("Color by Currency") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p3
#Make quartiles for dates
forecasts <- forecasts %>% mutate(quartile =
ntile(days, 4))
forecasts$quartile = as.factor(forecasts$quartile)
#Boxplot color by quartile
ggplot(data=forecasts, mapping = aes(x=alt, y=fe1,
color=quartile))+
geom_boxplot(fill="peachpuff")+
xlab("alt data") +
ylab("forecast error 1")+
ggtitle("Color by Days Quartile") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p4
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow=2)
Scatterplot of days and fe1 by alt
ggplot(data=forecasts)+geom_point(mapping =
aes(x=days, y=fe1, color=alt))+
ylab("forecast error 1")+
ggtitle("Fe1 by Days") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))
Initial Linear Regression Model
#LM model
lm.fit = lm(fe1~currency+days+alt+news+report+
alt:news+alt:report+alt:days+
alt:currency,data=forecasts)
summary(lm.fit)
Stepwise Selection
library(leaps)
nullmod = lm(fe1~1, data=forecasts) # Intercept only model
fullmod = lm(fe1~currency+days+alt+news+report+
alt:news+alt:report+alt:days+
alt:currency,data=forecasts) # all parameters
stepwisereg = step(nullmod, scope=list(lower=nullmod,
upper=fullmod), direction='both')
Best Model
summary(stepwisereg)
Residual Diagnostics
# plot residuals
plot(stepwisereg, which=1)
Scatterplots of Industry and Analyst
#Scatter Analyst vs Fe1
ggplot(data=forecasts)+geom_point(mapping =
aes(x=analyst, y=fe1, color=alt))+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
ylab("forecast error 1")+
ggtitle("Analyst vs. Fe1 by alt") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p1
#Boxplot Fe2
#Scatter Industry vs Fe1
ggplot(data=forecasts)+geom_point(mapping =
aes(x=industry, y=fe1, color=alt))+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
ylab("forecast error 1")+
ggtitle("Industry vs. Fe1 by alt") +
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(size=10))->p2
grid.arrange(p1, p2, ncol = 2)