The S and P and Amazon for Regression

Let’s summarize the data with a summary of the active data set and the numerical summaries.

> summary(Stocks)

      AMZN               X.GSPC                 WF           
 Min.   :-0.021180   Min.   :-1.502e-02   Min.   :-0.046349  
 1st Qu.:-0.007817   1st Qu.:-4.201e-03   1st Qu.:-0.010938  
 Median :-0.001385   Median :-5.807e-05   Median :-0.003903  
 Mean   :-0.001811   Mean   :-1.754e-04   Mean   :-0.002527  
 3rd Qu.: 0.004518   3rd Qu.: 2.504e-03   3rd Qu.: 0.006061  
 Max.   : 0.031468   Max.   : 2.514e-02   Max.   : 0.074884  
      YHOO           
 Min.   :-0.0388915  
 1st Qu.:-0.0081726  
 Median :-0.0009268  
 Mean   :-0.0011672  
 3rd Qu.: 0.0068134  
 Max.   : 0.0335509

> library(abind, pos=29)
> library(e1071, pos=30)


Attaching package: 'e1071'

The following objects are masked _by_ 'package:timeDate':

    kurtosis, skewness

> numSummary(Stocks[,c("AMZN", "WF", "X.GSPC", "YHOO")], statistics=c("mean", 
+   "sd", "se(mean)", "IQR"), quantiles=c(0,.25,.5,.75,1))

                mean          sd     se(mean)         IQR  n
AMZN   -0.0018112658 0.009058656 0.0010675728 0.012335107 72
WF     -0.0025267544 0.019386591 0.0022847317 0.016998839 72
X.GSPC -0.0001754316 0.006164777 0.0007265259 0.006705006 72
YHOO   -0.0011671601 0.014195213 0.0016729219 0.014985969 72

First, a scatterplot of the data with the least squares line (options). We want the S & P on the x axis and Amazon on the y.

> scatterplot(AMZN~X.GSPC, reg.line=lm, smooth=FALSE, spread=FALSE, 
+   boxplots=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9), 
+   xlab="Daily Returns (S&P)", ylab="Daily Returns (AMZN)", data=Stocks)

To see all of the data, we can examine a scatterplot matrix. The rows of the scatterplot matrix determine the y axis and the columns determine the x axis. For example, our single plot of AMZN on the y and the S and P on the x is the third plot in the top row. We will estimate three regressions, each one representing a scatterplot in the X.GSPC column starting with Amazon.

> scatterplotMatrix(~AMZN+WF+X.GSPC+YHOO, reg.line=lm, smooth=FALSE, 
+   spread=FALSE, span=0.5, ellipse=FALSE, levels=c(.5, .9), id.n=0, diagonal = 
+   'density', data=Stocks)

We can use a regression to tell us the characteristics of the line. To estimate a regression, we need statistics, fit models, linear regression, then the S and P as the explanatory variable and Amazon as the response.

> AMZN.Reg <- lm(AMZN~X.GSPC, data=Stocks)
> summary(AMZN.Reg)


Call:
lm(formula = AMZN ~ X.GSPC, data = Stocks)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0179448 -0.0032506  0.0003742  0.0040905  0.0160188 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0016364  0.0007903  -2.071   0.0421 *  
X.GSPC       0.9967614  0.1290441   7.724 5.88e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.006703 on 70 degrees of freedom
Multiple R-squared:  0.4601,    Adjusted R-squared:  0.4524 
F-statistic: 59.66 on 1 and 70 DF,  p-value: 5.881e-11

The regression output contains a basic five number summary of the residuals under Residuals:. Below the table of coefficients, we get the standard error of the residuals. How much variation is there in the leftovers? We are also given R-squared. That’s the amount of variance explained. Of the total variance in Amazon, how much covaries with the S and P? Here, that answer is 46%.

The coefficients give us the details of the regression. The estimate is the point estimate; the single best numerical guess. The standard error represents the uncertainty [the standard deviation] of the slope and/or intercept. We are also given a t-statistic that represents:

\[ t = \frac{Estimate - 0}{Std. Error} \]

The reported t measures the number of standard errors between the estimate and zero; the Pr(|t|) measures the two-sided probability that the estimate is zero. But everything involving t is only valid if the residuals from the regression model are normal. Let’s now turn to that.

In the R Commander, to isolate the residuals, we go to Models then to Add observation statistics to data. We will add the residuals and the fitted values to the data.

> Stocks<- within(Stocks, {
+   fitted.AMZN.Reg <- fitted(AMZN.Reg)
+   residuals.AMZN.Reg <- residuals(AMZN.Reg) 
+ })

Let me construct a plot to show this. The black dots are just the data of the scatterplot – the true values. The blue dots are the fitted values. For each daily return, they are the intercept plus the slope times that day’s return for the S and P 500. The red arrows show the residual. The difference between the fitted and actual values.

> plot(y=Stocks$AMZN, x=Stocks$X.GSPC, xlab="Pct. Chg. in S and P", ylab="Pct. Chg. in AMZN")
> points(Stocks$X.GSPC, with(Stocks, lm(AMZN~X.GSPC)$fitted.values), col="blue")
> arrows(Stocks$X.GSPC, with(Stocks, lm(AMZN~X.GSPC)$fitted.values), Stocks$X.GSPC, Stocks$AMZN, col="red", length=0.1)

Are these residuals normal? Let’s look at them in a density plot.

> densityPlot( ~ residuals.AMZN.Reg, data=Stocks, bw="SJ", adjust=1, 
+   kernel="gaussian")

The could be normal, though the left tail [the lower tail] is longer than the right. There is a single peak but it may not be symmetric. We can also compare it to a normal distribution using the quantile comparison plot.

> with(Stocks, qqPlot(residuals.AMZN.Reg, dist="norm", id.method="y", id.n=2, 
+   labels=rownames(Stocks)))

2016-07-27 2016-09-02 
         1          2

If the residuals were perfectly normal, every dot would be on the solid line. If the residuals are plausibly normal, almost all of the dots should fall inside the dashed confidence interval. Here, in the left end, there are a few that are outside the bounds. This is the longer left tail we saw in the density plot. These values are farther than they should be in a normal. First, a diversion on to the actual plot.

> with(Stocks, qqPlot(residuals.AMZN.Reg, dist="norm", id.method="y", id.n=2, 
+   labels=rownames(Stocks)))

2016-07-27 2016-09-02 
         1          2

But even this is not definitive; five points in every 100 should be outside of the lines with 95% confidence. That’s what it means. We want something quantitative. We have tests for this. If you have the R Commander Plugin HH active, you can use quantile comparison plot with test. Or we can draw the plot, look at it, and then examine the normality of the residuals using statistics then summaries then tests of normality. We will use the Shapiro-Wilk test first; the null hypothesis is that the residuals are normal. The alternative is non-normal. If the p-value is greater than 0.05 or 0.1, then the residuals are workably normal. Let’s see….

> library(nortest, pos=31)
> with(Stocks, shapiro.test(residuals.AMZN.Reg))


    Shapiro-Wilk normality test

data:  residuals.AMZN.Reg
W = 0.97514, p-value = 0.1645

These are plausibly normal. That means that we can utilize the t distribution for the estimates and generate predictions for the average and the outcomes.

The confidence intervals for the slope and intercept are in Models then Confidence Intervals….

How do we interpret them?

> Confint(AMZN.Reg, level=0.95)

                Estimate        2.5 %        97.5 %
(Intercept) -0.001636402 -0.003212623 -6.018179e-05
X.GSPC       0.996761360  0.739391053  1.254132e+00

If the S and P is unchanged, with 95% confidence, Amazon’s daily returns should range between -0.0032 and -0.00006018. With 95% confidence, Amazon loses money, on average, when the broader market is unchanged. For each one percent daily change in the S and P 500, Amazon’s daily returns increase by between 0.739 and 1.254 percent. It could be 1. It surely cannot be zero.

The final common use of the tool is prediction. What happens to Amazon if the S and P increases by 1 percent? Let’s use Models and Prediction Intervals to calculate it.

> .NewData <- data.frame(X.GSPC=0.01, row.names="1")
> .NewData  # Newdata

  X.GSPC
1   0.01

> predict(AMZN.Reg, newdata=.NewData, interval="confidence", level=.95, 
+   se.fit=FALSE)

          fit         lwr        upr
1 0.008331211 0.005274934 0.01138749

The confidence interval gives us an interval for the average. With 95% confidence, if the S and P increases by one percent in a day, then Amazon will increase, on average, between 0.0052 and 0.0114. Or one half to 1.14 percent. On average.

But normal residuals imply that the data, themselves, are normal. This allows us to predict the average, and to predict the full range of outcomes. What will happen to Amazon if the S and P increases by 1 percent?

> .NewData <- data.frame(X.GSPC=0.01, row.names="1")
> .NewData  # Newdata

  X.GSPC
1   0.01

> predict(AMZN.Reg, newdata=.NewData, interval="prediction", level=.95, 
+   se.fit=FALSE)

          fit          lwr        upr
1 0.008331211 -0.005382868 0.02204529

With 95% confidence, if the S and P increases by one percent, Amazon could drop by one half of one percent or it could gain 2.2 percent. That’s not on average, that’s overall. Though it averages positive returns with a 1 percent increase in the S and P, there will be days that it loses. As much as one half of one percent when the broader market is up 1 percent.

All this is, of course, given these data and past performance does not guarantee future returns.

The S and P and Amazon for Regression

Your Name

AUTOMATIC