Abstract

In this project, i would like to use some economic index to predict the S&P 500 index in stock market. Three variables are picked and multiply regression model will be utilized.

Data

The data was selected from the 100+ datasets and FRED economic data. It is able to access at http://research.stlouisfed.org/fred2/release?rid=26. The data set contains the quarterly data of S&P 500 index, consumer price index, unemployment rate and purchasing manager’s index. There are totally 40 data per variable gathered from year 2005 to 2014. We have one dependent variable and three independent variables that described as below.

Dependent Variable:

S&P500 index

The S&P 500 is regarded as a gauge of the large cap U.S. equities markets, which includes 500 leading companies in leading industries of the US economy.

Independent Variables:

No.1: CPI

Consumer Price Index(CPI) which is a measure of the average monthly change in the price for goods and service.

No.2: UR

Unemployment rate (UR) for all Persons age from 15 to 64 for the United States. Recorded in percentage.

No.3: PMI

A Purchasing Managers’ Index (PMI) above 50 percent indicates that the manufacturing economy is generally expanding; below 50 percent that it is generally declining.

data<-read.table("D:\\RPI QFRA\\QFRA S2\\applied regression\\project22.txt",header=T)
data1<-data[c("SP500","CPI","UR","PMI")]
head(data);
##         Date   SP500     CPI       UR  PMI
## 1 2005-04-01 1181.97 193.667 5.169840 51.8
## 2 2005-07-01 1224.17 196.600 5.025390 54.0
## 3 2005-10-01 1230.47 198.433 5.018598 56.3
## 4 2006-01-01 1283.66 199.467 4.792543 55.0
## 5 2006-04-01 1280.80 201.267 4.736297 53.6
## 6 2006-07-01 1288.34 203.167 4.695755 53.0

Null hypothesis:

The variable that is observed in the variable S&P 500 can be explained by the variation existent in any of the three treatments being considered in this experiment, which are CPI, PMI and UR.

Here is the summary of the data set.

attach(data);
summary(data);
##          Date        SP500             CPI              UR        
##  2005-04-01: 1   Min.   : 807.7   Min.   :193.7   Min.   : 4.516  
##  2005-07-01: 1   1st Qu.:1214.4   1st Qu.:209.2   1st Qu.: 5.022  
##  2005-10-01: 1   Median :1318.3   Median :217.4   Median : 7.089  
##  2006-01-01: 1   Mean   :1356.2   Mean   :218.3   Mean   : 7.097  
##  2006-04-01: 1   3rd Qu.:1492.6   3rd Qu.:229.4   3rd Qu.: 8.956  
##  2006-07-01: 1   Max.   :2009.3   Max.   :237.5   Max.   :10.081  
##  (Other)   :33                                                    
##       PMI       
##  Min.   :35.50  
##  1st Qu.:50.90  
##  Median :52.70  
##  Mean   :52.37  
##  3rd Qu.:56.00  
##  Max.   :59.10  
## 

Afterwards, we build the simple linear regression to describe the relationship between each independent variable and the dependent variable S&P 500 index.

Model: SP500 vs CPI

plot(CPI,SP500, main="SP500 vs CPI", xlim = c(180,250))
newx<-seq(180,250)
ress<-lm(SP500~CPI)
x<-predict(ress,newdata=data.frame(CPI=newx),interval = c("predict"), level = 0.95,type="response")
abline(ress,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)

Model: SP500 vs UR

plot(UR,SP500, main="SP500 vs UR", xlim = c(4,11))
newx<-seq(4,11)
resa<-lm(SP500~UR)
x<-predict(resa,newdata=data.frame(UR=newx),interval = c("predict"), level = 0.95,type="response")
abline(resa,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)

Model: SP500 vs PMI

plot(PMI,SP500, main="SP500 vs PMI", xlim = c(35,60))
newx<-seq(35,60)
resp<-lm(SP500~PMI)
x<-predict(resp,newdata=data.frame(PMI=newx),interval = c("predict"), level = 0.95,type="response")
abline(resp,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)

As we can be seen from our three 95% confidence interval plots above, 95% of the sample points are scattering between upper bound and lower bound. At 95% significance level, those points within the two bounds are considered to be reasonable. Those points outside the bound are considered as outliers.

Result

We get three approaches to build the multiply regression model and they are Entry-wise, hierarchical and sequential. Firstly, we Put all the all the independent variables into the multiply linear regression.

Entry-wise

And here is the result.

res<-lm(SP500~CPI+UR+PMI)
summary(res)
## 
## Call:
## lm(formula = SP500 ~ CPI + UR + PMI)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.635  -50.172    6.319   55.880  101.302 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3022.5755   198.3407 -15.239  < 2e-16 ***
## CPI            19.1448     0.9337  20.505  < 2e-16 ***
## UR           -119.2579     6.0264 -19.789  < 2e-16 ***
## PMI            19.9767     2.0668   9.666 2.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared:  0.9528, Adjusted R-squared:  0.9488 
## F-statistic: 235.7 on 3 and 35 DF,  p-value: < 2.2e-16

Interpretation

CPI, UR and PMI explain totally 94.88% variations of the S&P 500, as evidence from the adjusted R^2 value. All three independent variables in the model are significant at 0.1% level. Hence, from this approach, we can conclude that CPI, UR and PMI are able to explain the variation in S&P 500 index. It is reasonable to use CPI, UR and PMI to predict the S&P index.

Hierarchical

To build a Hierarchical model, the independent variables are added one at a time in a theoretical manner. We will firstly study the unemployment rate, then PMI and finally CPI. To some extent, unemployment rate reflect condition of the economy. Higher unemployment rate should considered as the negative signal to the economy.

Model1:

res1<-lm(SP500~UR);
summary(res1);
## 
## Call:
## lm(formula = SP500 ~ UR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -477.73 -178.26  -37.25   88.27  582.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1744.01     161.94  10.770 5.85e-13 ***
## UR            -54.65      22.03  -2.481   0.0178 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263.8 on 37 degrees of freedom
## Multiple R-squared:  0.1426, Adjusted R-squared:  0.1194 
## F-statistic: 6.154 on 1 and 37 DF,  p-value: 0.01779

Interpretation

From the result, it is true that unemployment rate have negative effect. It explains 14.26% variations in S&P 500 index. It coefficient is significant at 5% significance level. As the unemployment increase 1%, S&P index decrease 54.65. Then, PMI actually reflects the industry manager’s purchasing abilities. Higher PMI greater abilities. Thus, the industries are in well-condition. Model2:

res2<-lm(SP500~UR+PMI);
summary(res2)
## 
## Call:
## lm(formula = SP500 ~ UR + PMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -362.00 -174.72  -22.63   99.29  449.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  363.293    390.817   0.930  0.35878    
## UR           -62.016     18.997  -3.265  0.00241 ** 
## PMI           27.361      7.239   3.780  0.00057 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226.3 on 36 degrees of freedom
## Multiple R-squared:  0.3862, Adjusted R-squared:  0.3521 
## F-statistic: 11.32 on 2 and 36 DF,  p-value: 0.000153
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
md <- scatterplot3d(UR,PMI,SP500,pch = 21, main = "Regression plane",bg = 'blue',xlab = "UR", ylab = "PMI", zlab = "SP500",axis = TRUE);
md$plane3d(res2)

Interpretation

The adjusted R square is increased to 35.21% now. UR and PMI explains 35.21% variation in S&P index. The coefficient on UR decrease to -62.016 because of suppression. PMI have positive effects on S&P 500 index. One unit increase in PMI leads to 27.361 units in S&P 500 index. Both two coefficients on independent variables should not equal to zero because of low p-value. Thus, this multiply regression is better than model 1 above. Here is the 3-D plot of the regression model above.

Model3:

res3<-lm(SP500~UR+CPI+PMI);
summary(res3);
## 
## Call:
## lm(formula = SP500 ~ UR + CPI + PMI)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.635  -50.172    6.319   55.880  101.302 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3022.5755   198.3407 -15.239  < 2e-16 ***
## UR           -119.2579     6.0264 -19.789  < 2e-16 ***
## CPI            19.1448     0.9337  20.505  < 2e-16 ***
## PMI            19.9767     2.0668   9.666 2.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared:  0.9528, Adjusted R-squared:  0.9488 
## F-statistic: 235.7 on 3 and 35 DF,  p-value: < 2.2e-16

Interpretation

Finally, we put all three independent variables in the regression. As we concluded in Entry-wise section, the model becomes better and it explains the variations of S&P 500 index better. It adjust R square is really a huge step. CPI have positive effects as well. It seems that CPI and PMI have almost same effect on index, as evidence from their coefficients. In conclusion, this model is better than model 2. We should choose this model to do prediction.

Sequential

To build a sequential model, we add variables one at a time based on their correlation with the dependent variable. We add the most correlated independent variable first, then continue to add the other variables step by step. Finally, we may find the best model according to their variation explanation abilities.

plot(data1, pch=21, cex=1, bg='ivory4', main="SP500 versus SP500, GDP, and PMI");

cor(data1)
##            SP500       CPI         UR       PMI
## SP500  1.0000000 0.5510486 -0.3776209 0.4521841
## CPI    0.5510486 1.0000000  0.4720091 0.2012348
## UR    -0.3776209 0.4720091  1.0000000 0.1026232
## PMI    0.4521841 0.2012348  0.1026232 1.0000000

Interpretation

From the scatter plot, both PMI and CPI are positively correlated with S&P 500 while UR is negatively correlated with S&P 500 index. Additionally, the variance-covariance matrix proves that. Thus, according to the correlations between variables, we need to implement the independent variables in this order: CPI(0.551) ->PMI(0.452)->UR(-0.3776). Here is the result for the relationship between CPI and S&P 500:

sres1<-lm(SP500~CPI);
summary(sres1);
## 
## Call:
## lm(formula = SP500 ~ CPI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -476.79 -189.33   48.73  187.42  424.90 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1299.394    662.219  -1.962 0.057296 .  
## CPI            12.166      3.029   4.017 0.000278 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 237.7 on 37 degrees of freedom
## Multiple R-squared:  0.3037, Adjusted R-squared:  0.2848 
## F-statistic: 16.13 on 1 and 37 DF,  p-value: 0.0002775

Interpretation

The adjusted R-square is 28.48%. Also, we are unable to reject the hypothsis that CPI are independent with S&P 500 index because of the low p-value. One unit increase in CPI cause 12.166 increase in S&P 500 index.

Then, PMI is added.

sres2<-lm(SP500~PMI+CPI);
summary(sres2);
## 
## Call:
## lm(formula = SP500 ~ PMI + CPI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -345.11 -199.60   12.21  148.45  365.77 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1981.763    658.329  -3.010 0.004747 ** 
## PMI            19.615      7.115   2.757 0.009101 ** 
## CPI            10.586      2.848   3.716 0.000684 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219 on 36 degrees of freedom
## Multiple R-squared:  0.4251, Adjusted R-squared:  0.3931 
## F-statistic: 13.31 on 2 and 36 DF,  p-value: 4.713e-05
library(scatterplot3d)
md <- scatterplot3d(PMI,CPI,SP500,pch = 21, main = "Regression plane",bg = 'blue',xlab = "UR", ylab = "CPI", zlab = "SP500",axis = TRUE);
md$plane3d(sres2)

Interpretation

The adjusted R-square increased to 39.31% in this multiply model. Again, we are unable to deduce that both CPI and PMI are independent with the index because they are significant from 1% significance level. Thus, this multiply model is preferred.

Finally, all three variables are used.

summary(res3)
## 
## Call:
## lm(formula = SP500 ~ UR + CPI + PMI)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.635  -50.172    6.319   55.880  101.302 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3022.5755   198.3407 -15.239  < 2e-16 ***
## UR           -119.2579     6.0264 -19.789  < 2e-16 ***
## CPI            19.1448     0.9337  20.505  < 2e-16 ***
## PMI            19.9767     2.0668   9.666 2.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared:  0.9528, Adjusted R-squared:  0.9488 
## F-statistic: 235.7 on 3 and 35 DF,  p-value: < 2.2e-16

Interpretation

Same as the result of model 3 in Hierarchical section. Because all the coefficient on these independent variables are significant and the adjusted R-square increase a lot. Final Model should be:

SP500=-3022.5755-119.2579UR+ 19.1448 CPI+19.9767PMI

Keep other variables constant, One unit PMI increase contributes to 19.9767 increase in S&P, one unit CPI increase contributes to 19.1448 increase in S&P and one unit increase in UR leads to 119.257 decrease in index.

layout(matrix(c(2,1), 2, 1, byrow = TRUE),widths=c(3,1), heights=c(3,3))
model <- resid(res3)
plot(fitted(res3), model, pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(model, main="Model Residual Histogram",xlab = "Fitted value of model")

boxplot(model, main="Boxplot of Residuals");

From the box-plot, we observed that the residuals is a little biased and skew. Because there are only 40 data, we cannot get the result from histogram. In the following project, it is possible to assess the normality assumption of residuals.

Summary

The variation in SP500 can be explained by UR, CPI and PMI. Three variables are able to explain 94.88% variations in S&P 500 index. In further study, we will gather more data and assess assumptions of the model.