Reading Stocks

My R code for running the analysis all assignments follows. The URL for the download of stocks is identified solely by the stock symbol and some fixed parameters. I normally just edit the URL, but I though you might like to make this easier by just filling in the parameters below.

So, for any stock, I can read in the 5-year data for any given time. Remember, they are ordered from most current to least current, so sort them.

Unit 1 Assignment

baseURL="http://chart.finance.yahoo.com/table.csv?"
stock="F"  #stock symbol
sm=0       #start month, 0==Jan
sd=13      #start day
sy=2012    #start year
em=0       #end month
ed=12      #end day
ey=2017    #end month
freq="d"   #frequency
#I will paste this together, so that you can just edit the parameters above.  

mystock=paste(baseURL,"s=",stock,"&a=",sm,"&b=",sd,"&c=",sy,"&d=",em,"&e=",ed,"&f=",ey,"&g=",freq,"&ignore=.csv",sep="")
b=read.csv(mystock)  #read in the data
b=b[order(as.Date(b$Date, format="%Y-%m-%d")),]  #sort the dataframe by date

Plots and Descriptives

It is trivial to have all of the plots and descriptives run at once.

par(mfrow=c(2,2)) #set up the graph space as 2 x 2
boxplot(b$Adj.Close, horizontal=T, xlab="Adj. Close $", main=stock)
hist(b$Adj.Close, xlab="Adj. Close $", main=stock)
describe(b$Adj.Close)
##    vars    n  mean   sd median trimmed  mad min   max range  skew kurtosis
## X1    1 1258 12.35 1.98  12.82   12.54 1.82 7.4 15.52  8.12 -0.76    -0.26
##      se
## X1 0.06
plot(b$Adj.Close~seq(1:length(b$Adj.Close)), type="l", xlab="Observation #", main=stock)

Just looking at the boxplot and histogram, you can see that this stock, Ford, has large negative skewness. It has a small range and is asymmetric. The scatterplot of Adjusted Close by sequence number (timeplot) is revealing. During the early observations, the whole auto industry was struggling. It was not until May 2012 that Ford’s bonds were upgraded from junk bond status, so it was still struggling to gain footage. After this time, the stock has performed steadily. Since it is a dividend producing stock, this makes sense. We would expect that dividend producing stocks have less volatility, because investors pay for the dividend, not for potential increases in stock prices.

Looking at the statistics, we see evidence of negative skewness, as the mean ($12.55) is lower than the median ($13.03).(Also, the skewness statistic is negative). We can see that the standard deviation (s) is $2.01. The coefficient of variation helps us understand the size of the standard deviation relative to the mean by dividing the standard deviation by the mean. ($2.01 / $12.55) x 100 = 14.8%. Now we could use this value to compare “riskiness” of investments.

From a business decision-making perspective, we have some initial evidence that Ford adjusted closing prices have stabilized over the long-term. Perhaps, the adjusted closing price stability over the last few years shows that Ford has recovered reasonably well from its junk bond experience. While additional research is warranted before adding it to a business (or personal) portfolio, we have no reason to exclude Ford, particularly if we are looking for stable, dividend-producing investments. (The recent dividend yields are pretty good.)

Unit 2 Assignment

Confidence Interval

In this unit, we simply create a 95% CI for the sample mean. Recall, that the formula for creating the confidence interval for a t-distribution is as follows. \[\overline{x}\pm t_{(df,.975)}\times {\frac s{\sqrt{n}}} \]

ci=t.test(b$Adj.Close, level=.95)$conf.int
result=c(ci[1], ci[2])
names(result)=c("lower","upper")
result
##    lower    upper 
## 12.24057 12.45949

We would expect 95% of the intervals constructed using the same sample size on the sample population to contain the true population mean. If this interval is one of those 95%, then the true population mean would be between the $12.44 and $12.66. It seems reasonable that we could estimate the mean stock price within $.22. If the interval here is large, then we have another indicator of risk. We also might be interested in using this mean and the associated variance to generate simulations of stock returns in portfolio analysis. And fundamentally, the interval just tells us what we should expect. Be careful here…this interval is conducted on a sequential sample. If there are dependencies with time, then using the interval alone is dangerous. If we stuck with stable data (e.g., Ford after the junk bond status), this would be better!

Sample Size

Next, we want to know the sample size required to bracket the mean adjusted daily closing price withing 50 cents. We always assume a level of .05 unless told otherwise! Since we are assuming that the sample standard deviation is the population standard deviation, we have the following formula. \[N=\lceil{\frac{Z^{2}\times\sigma^{2}}{\epsilon^2}}\rceil\]

N=ceiling((qnorm(.975)*sd(b$Adj.Close))^2 / .50^2)

In this case, a sample size of 63 would suffice assuming that this sequential sample is representative of the rest of the observations. We really shouldn’t make that assumption unless the data are stable, though. Why?

Unit 3 Assignment

Hypothesis Test 1, Mean Adjusted Closing Price

In this unit, we first simply conduct a hypothesis test regarding whether the mean adjusted closing price of our stock has changed from the base price. Why would this be interesting? This would indicate that the mean price is at least statistically better than the originating price. Think about what that might mean… In the case of this example, the first observation in the data set is $10.19. So our test value (call it X(1)) is $10.19.

\[H_0: \mu \le X_1\] \[H_a: \mu > X_1\]

\[ \alpha=.05\] We assume that the distribution is distributed as a t-distribution.
Then the test statistic is as follows. \[ t_{df}=\frac{\overline{x}-mu}{s_e}\] The degrees of freedom for a one-sample t-test are n-1. The larger the number of degrees of freedom, the closer to normality the distribution becomes.

\[s_e=\frac{s}{\sqrt{n}}\]

myt=t.test(b$Adj.Close,mu=b$Adj.Close[1],alternative="greater", alpha=.05)
myt
## 
##  One Sample t-test
## 
## data:  b$Adj.Close
## t = 44.629, df = 1257, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 9.860075
## 95 percent confidence interval:
##  12.25819      Inf
## sample estimates:
## mean of x 
##  12.35003

The results of our hypothesis test show a very small p-value. This means that the location of the mean of the data compared to our fixed data point is largely different. Since our p-value is near zero (2.2e-16 means move the decimal to the left 16 times) and is below our alpha value of .05 (standard), then we reject the null hypothesis. Clearly, the mean adjusted closing price of Ford is greater than the baseline price. Also, the results would have been identical had we chosenn an alpha level of .01. Why? Because our p-value would still be below .05. Again, think about what this might mean. The stock is not so stable as to be hovering only around a specific observation for five years. Had we stripped away the portion where Ford was having financial struggles, then the results might have been different.

Hypothesis Test 2, Mean Daily Volume

We want to do a second hypothesis about the volume of stock trading. Volume may be volatile for stocks which are trending, but it might also be stable over the long-haul (e.g., dividend producing stocks).

Testing to see if volume has changed would make sense if we wanted to evaluate stability over two time frames. For example, if we wanted to do a pre-post analysis after a sales / marketing event, then we would test two samples (one before and one after) to see if we garnered more interest if not higher prices.

In this case, we will test our mean volume against the starting volume. However, the test is two-tailed, meaning we want to know if volume increased or decreased. Let’s define our variable as Y this time. And let’s investigate if our mean volume differs from our starting volume, \[Y_1\]. If so, then we have evidence of increase / decrease.

\[H_0: \mu = Y_1\] \[H_a: \mu \ne Y_1\] \[\alpha = .05\] The test statistic remains the t.

myt2=t.test(b$Volume,mu=b$Volume[1],alternative="two.sided", alpha=.05)
myt2
## 
##  One Sample t-test
## 
## data:  b$Volume
## t = -21.228, df = 1257, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 46366700
## 95 percent confidence interval:
##  34977071 36904204
## sample estimates:
## mean of x 
##  35940638

Once again, we have a very small p-value (near zero), meaning we need to reject the null hypothesis. Mean volume is not equal to the baseline volume. Further, we can estimate that volume has actually declined, because the mean of the data is smaller than the starting volume. Recall, the baseline value was during a time when Ford was under pressure. Some were trying to get out before losing everything. Others were trying to get in at a bargain (but most were escaping, hence the downward pressure on the stock then.) Obviously, changing the alpha level makes no difference, because p is so close to zero.

Unit 4 Assignments

Download yet another stock

First, we have to download another stock, one that begins with the first letter of our first name. I chose Lockheed Martin.

baseURL="http://chart.finance.yahoo.com/table.csv?"
stock2="LMT"  #stock symbol
sm=0       #start month, 0==Jan
sd=13      #start day
sy=2012    #start year
em=0       #end month
ed=12      #end day
ey=2017    #end month

#I will paste this together, so that you can just edit the parameters above.  

mystock2=paste(baseURL,"s=",stock2,"&a=",sm,"&b=",sd,"&c=",sy,"&d=",em,"&e=",ed,"&f=",ey,"&g=d&ignore=.csv",sep="")
b2=read.csv(mystock2)  #read in the data
b2=b2[order(as.Date(b2$Date, format="%Y-%m-%d")),]  #sort the dataframe by date

Hypothesis Test 1, Paired / Matched t-test

To conduct a matched t-test, we simply need to subtract one stock from the other. This is a matched t-test here, where we are comparing the differences in the volumes of the stocks on a daily basis. Clearly, there could be an order of magnitude issue here. You would ONLY want to do this if you either normalized the volumes (e.g., standardized) or had stocks with similar volume levels. Let’s assume we believe the volumes are similar and start with an orderly approach.

\[ H_o: \mu_1-\mu_2=0\] \[H_a: \mu_1-\mu_2 \ne 0\] \[\alpha=.05\]

This is still a t-test, and we can run a one-sample t-test by subtracting the means or just dictating that they are paired observations.

myt3=t.test(b$Volume, b2$Volume, paired=T)
myt3
## 
##  Paired t-test
## 
## data:  b$Volume and b2$Volume
## t = 70.152, df = 1257, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  33293347 35209059
## sample estimates:
## mean of the differences 
##                34251203

Obviously, the volumes are different, because the p-value is less than alpha at all levels. We should expect this if we are testing two different stocks without normalizing the volume!!
##Hypothesis Test 2, Two-sample t-test

Next, we have a more reasonable hypothesis test. We want to test whether the mean adjusted closing price for the last three years has increased over the mean closing price of the first two years. Notice, we are interested in an increase, so this belongs in our alternative hypothesis. If we or our company own (and are not shorting) a long-term stock that doesn’t pay dividends, that should be our hope! Again, we start with the basics. Note that the mean adjusted close of last three years will be represented by Mu(1) here. It definitely matters the order in which you run these tests!!

\[ H_o: \mu_1-\mu_2 \le 0\]

\[ H_a: \mu_1-\mu_2>0\] \[\alpha=.05\] We have a t-distribution from two-samples. It might be an equal variance or unequal variance t-test, meaning that we would typically investigate that using an F-test first. The null hypothesis for this F test is as follow. \[H_o: \sigma^2_1 = \sigma^2_2\] The alternative is then \[H_a: \sigma^2_1 \ne \sigma^2_2\] Note: this implies that the ratio of variances is 1 under the null. We use the sample variances instead of our unknown population variances to conduct the test, understanding that the ratio of two variances happens to be F distributed. If we reject the null hypothesis

We will do so post-hoc. But first, let’s split the groups and run the test.

g1=b$Adj.Close[1:502]
g2=b$Adj.Close[503:1258]
var.test(g1,g2)
## 
##  F test to compare two variances
## 
## data:  g1 and g2
## F = 5.1323, num df = 501, denom df = 755, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  4.379733 6.029981
## sample estimates:
## ratio of variances 
##           5.132311
t.test(g1, g2, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  g1 and g2
## t = -21.257, df = 1256, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.269516 -1.886001
## sample estimates:
## mean of x mean of y 
##  11.10139  13.17915
t.test(g1,g2,var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  g1 and g2
## t = -18.535, df = 632.01, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.297888 -1.857629
## sample estimates:
## mean of x mean of y 
##  11.10139  13.17915

From the results of the F test, we can see that the null hypothesis that the variances are equal is rejected, because the p-value is below .05. This means we should use the unequal variance t-test (also known as the Welch.)

The results of this test are also clear. Reject the null hypothesis that that years 3 through 5 adjusted closing prices are less than or equal to years 1 and 2. In fact, the latter period has a higher mean adjusted closing price, as the p-value is small (well below . 05 or even .001). So the stock has seen real gains from the previous two year means. That is nice to know. We would definitely want to investigate further, but we would surmise that if a company invested five years ago in a long-term position, they would at least have paper gains!

Hypothesis Test 3, ANOVA

Finally, we come to the last hypothesis test of the week, the ANOVA. This requires us to download monthly data and evalute it. Yes, we are all about data.

baseURL="http://chart.finance.yahoo.com/table.csv?"
stock="F"  #stock symbol
sm=0       #start month, 0==Jan
sd=13      #start day
sy=2012    #start year
em=0       #end month
ed=12      #end day
ey=2017    #end month
freq="m"   #frequency
#I will paste this together, so that you can just edit the parameters above.  

mystock3=paste(baseURL,"s=",stock,"&a=",sm,"&b=",sd,"&c=",sy,"&d=",em,"&e=",ed,"&f=",ey,"&g=",freq,"&ignore=.csv",sep="")
mo=read.csv(mystock3)  #read in the data
mo=mo[order(as.Date(mo$Date, format="%Y-%m-%d")),]  #sort the dataframe by date
mo$Month=as.factor(substr(mo$Date,6,7))  #create a monthly indicator

Ok, with the 5-years of monthly data, we can now investigate if there are seasonal effects by month! As usual, we start with our hypotheses.

\[H_o: \mu_1=\mu_2=...\mu_{12}\] \[H_a: Not\hspace{1mm} all\hspace{1mm} \mu \hspace{1mm} are\ equal\] As usual, \[\alpha=.05\]

Now, we can actually do some boxplots to help us!

boxplot(mo$Adj.Close~mo$Month,col=seq(1:12), main="Mean Adj. Close by Month")

We have some months that appear to have distributions that are above / below others, but it’s hard to tell if those are statistically significant. Ok, so it’s type to run the ANOVA.

myaov=aov(mo$Adj.Close~mo$Month)
summary(myaov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## mo$Month    11   3.89   0.354   0.078      1
## Residuals   49 222.89   4.549

The results are not statistically significant at the .05 level. We FAIL to reject the null. Why? We have a p-value of .078 which is greater than .05. Had the alpha level been .10, the result would have been different. But right now, we cannot reject the null that there is NO difference in the monthly adjusted closing prices. Further, we should have expected that! Why? We have very little power with the number of observations per month and the number of variables we are including in our model. We have 12 months and only 60 or 61 observations (depending on how the data are pulled). That means we only have 5 observations per variable! It’s tough to do anything with that!

Now, let’s assume alpha was .10 and we rejected the null. What then? We have to run post-hoc tests to determine which means are actually different. Post-hocs are just touche on in this course, so I will leave it to you to investigate.