In this project, i would like to use some economic index to predict the S&P 500 index in stock market. Three variables are picked and multiply regression model will be utilized.
The data was selected from the 100+ datasets and FRED economic data. It is able to access at http://research.stlouisfed.org/fred2/release?rid=26. The data set contains the quarterly data of S&P 500 index, consumer price index, unemployment rate and purchasing manager’s index. There are totally 40 data per variable gathered from year 2005 to 2014. We have one dependent variable and three independent variables that described as below.
The S&P 500 is regarded as a gauge of the large cap U.S. equities markets, which includes 500 leading companies in leading industries of the US economy.
Consumer Price Index(CPI) which is a measure of the average monthly change in the price for goods and service.
Unemployment rate (UR) for all Persons age from 15 to 64 for the United States. Recorded in percentage.
A Purchasing Managers’ Index (PMI) above 50 percent indicates that the manufacturing economy is generally expanding; below 50 percent that it is generally declining.
data<-read.table("D:\\RPI QFRA\\QFRA S2\\applied regression\\project22.txt",header=T)
data1<-data[c("SP500","CPI","UR","PMI")]
head(data);
## Date SP500 CPI UR PMI
## 1 2005-04-01 1181.97 193.667 5.169840 51.8
## 2 2005-07-01 1224.17 196.600 5.025390 54.0
## 3 2005-10-01 1230.47 198.433 5.018598 56.3
## 4 2006-01-01 1283.66 199.467 4.792543 55.0
## 5 2006-04-01 1280.80 201.267 4.736297 53.6
## 6 2006-07-01 1288.34 203.167 4.695755 53.0
The variable that is observed in the variable S&P 500 can be explained by the variation existent in any of the three treatments being considered in this experiment, which are CPI, PMI and UR.
Here is the summary of the data set.
attach(data);
summary(data);
## Date SP500 CPI UR
## 2005-04-01: 1 Min. : 807.7 Min. :193.7 Min. : 4.516
## 2005-07-01: 1 1st Qu.:1214.4 1st Qu.:209.2 1st Qu.: 5.022
## 2005-10-01: 1 Median :1318.3 Median :217.4 Median : 7.089
## 2006-01-01: 1 Mean :1356.2 Mean :218.3 Mean : 7.097
## 2006-04-01: 1 3rd Qu.:1492.6 3rd Qu.:229.4 3rd Qu.: 8.956
## 2006-07-01: 1 Max. :2009.3 Max. :237.5 Max. :10.081
## (Other) :33
## PMI
## Min. :35.50
## 1st Qu.:50.90
## Median :52.70
## Mean :52.37
## 3rd Qu.:56.00
## Max. :59.10
##
Afterwards, we build the simple linear regression to describe the relationship between each independent variable and the dependent variable S&P 500 index.
Model: SP500 vs CPI
plot(CPI,SP500, main="SP500 vs CPI", xlim = c(180,250))
newx<-seq(180,250)
ress<-lm(SP500~CPI)
x<-predict(ress,newdata=data.frame(CPI=newx),interval = c("predict"), level = 0.95,type="response")
abline(ress,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)
Model: SP500 vs UR
plot(UR,SP500, main="SP500 vs UR", xlim = c(4,11))
newx<-seq(4,11)
resa<-lm(SP500~UR)
x<-predict(resa,newdata=data.frame(UR=newx),interval = c("predict"), level = 0.95,type="response")
abline(resa,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)
Model: SP500 vs PMI
plot(PMI,SP500, main="SP500 vs PMI", xlim = c(35,60))
newx<-seq(35,60)
resp<-lm(SP500~PMI)
x<-predict(resp,newdata=data.frame(PMI=newx),interval = c("predict"), level = 0.95,type="response")
abline(resp,lm$coof,lwd=2)
lines(newx,x[,3], lty=1)
lines(newx,x[,2], lty=1)
As we can be seen from our three 95% confidence interval plots above, 95% of the sample points are scattering between upper bound and lower bound. At 95% significance level, those points within the two bounds are considered to be reasonable. Those points outside the bound are considered as outliers.
We get three approaches to build the multiply regression model and they are Entry-wise, hierarchical and sequential. Firstly, we Put all the all the independent variables into the multiply linear regression.
And here is the result.
res<-lm(SP500~CPI+UR+PMI)
summary(res)
##
## Call:
## lm(formula = SP500 ~ CPI + UR + PMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.635 -50.172 6.319 55.880 101.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3022.5755 198.3407 -15.239 < 2e-16 ***
## CPI 19.1448 0.9337 20.505 < 2e-16 ***
## UR -119.2579 6.0264 -19.789 < 2e-16 ***
## PMI 19.9767 2.0668 9.666 2.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared: 0.9528, Adjusted R-squared: 0.9488
## F-statistic: 235.7 on 3 and 35 DF, p-value: < 2.2e-16
CPI, UR and PMI explain totally 94.88% variations of the S&P 500, as evidence from the adjusted R^2 value. All three independent variables in the model are significant at 0.1% level. Hence, from this approach, we can conclude that CPI, UR and PMI are able to explain the variation in S&P 500 index. It is reasonable to use CPI, UR and PMI to predict the S&P index.
To build a Hierarchical model, the independent variables are added one at a time in a theoretical manner. We will firstly study the unemployment rate, then PMI and finally CPI. To some extent, unemployment rate reflect condition of the economy. Higher unemployment rate should considered as the negative signal to the economy.
Model1:
res1<-lm(SP500~UR);
summary(res1);
##
## Call:
## lm(formula = SP500 ~ UR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -477.73 -178.26 -37.25 88.27 582.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1744.01 161.94 10.770 5.85e-13 ***
## UR -54.65 22.03 -2.481 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.8 on 37 degrees of freedom
## Multiple R-squared: 0.1426, Adjusted R-squared: 0.1194
## F-statistic: 6.154 on 1 and 37 DF, p-value: 0.01779
From the result, it is true that unemployment rate have negative effect. It explains 14.26% variations in S&P 500 index. It coefficient is significant at 5% significance level. As the unemployment increase 1%, S&P index decrease 54.65. Then, PMI actually reflects the industry manager’s purchasing abilities. Higher PMI greater abilities. Thus, the industries are in well-condition. Model2:
res2<-lm(SP500~UR+PMI);
summary(res2)
##
## Call:
## lm(formula = SP500 ~ UR + PMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -362.00 -174.72 -22.63 99.29 449.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 363.293 390.817 0.930 0.35878
## UR -62.016 18.997 -3.265 0.00241 **
## PMI 27.361 7.239 3.780 0.00057 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 226.3 on 36 degrees of freedom
## Multiple R-squared: 0.3862, Adjusted R-squared: 0.3521
## F-statistic: 11.32 on 2 and 36 DF, p-value: 0.000153
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
md <- scatterplot3d(UR,PMI,SP500,pch = 21, main = "Regression plane",bg = 'blue',xlab = "UR", ylab = "PMI", zlab = "SP500",axis = TRUE);
md$plane3d(res2)
The adjusted R square is increased to 35.21% now. UR and PMI explains 35.21% variation in S&P index. The coefficient on UR decrease to -62.016 because of suppression. PMI have positive effects on S&P 500 index. One unit increase in PMI leads to 27.361 units in S&P 500 index. Both two coefficients on independent variables should not equal to zero because of low p-value. Thus, this multiply regression is better than model 1 above. Here is the 3-D plot of the regression model above.
Model3:
res3<-lm(SP500~UR+CPI+PMI);
summary(res3);
##
## Call:
## lm(formula = SP500 ~ UR + CPI + PMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.635 -50.172 6.319 55.880 101.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3022.5755 198.3407 -15.239 < 2e-16 ***
## UR -119.2579 6.0264 -19.789 < 2e-16 ***
## CPI 19.1448 0.9337 20.505 < 2e-16 ***
## PMI 19.9767 2.0668 9.666 2.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared: 0.9528, Adjusted R-squared: 0.9488
## F-statistic: 235.7 on 3 and 35 DF, p-value: < 2.2e-16
Finally, we put all three independent variables in the regression. As we concluded in Entry-wise section, the model becomes better and it explains the variations of S&P 500 index better. It adjust R square is really a huge step. CPI have positive effects as well. It seems that CPI and PMI have almost same effect on index, as evidence from their coefficients. In conclusion, this model is better than model 2. We should choose this model to do prediction.
To build a sequential model, we add variables one at a time based on their correlation with the dependent variable. We add the most correlated independent variable first, then continue to add the other variables step by step. Finally, we may find the best model according to their variation explanation abilities.
plot(data1, pch=21, cex=1, bg='ivory4', main="SP500 versus SP500, GDP, and PMI");
cor(data1)
## SP500 CPI UR PMI
## SP500 1.0000000 0.5510486 -0.3776209 0.4521841
## CPI 0.5510486 1.0000000 0.4720091 0.2012348
## UR -0.3776209 0.4720091 1.0000000 0.1026232
## PMI 0.4521841 0.2012348 0.1026232 1.0000000
From the scatter plot, both PMI and CPI are positively correlated with S&P 500 while UR is negatively correlated with S&P 500 index. Additionally, the variance-covariance matrix proves that. Thus, according to the correlations between variables, we need to implement the independent variables in this order: CPI(0.551) ->PMI(0.452)->UR(-0.3776). Here is the result for the relationship between CPI and S&P 500:
sres1<-lm(SP500~CPI);
summary(sres1);
##
## Call:
## lm(formula = SP500 ~ CPI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -476.79 -189.33 48.73 187.42 424.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1299.394 662.219 -1.962 0.057296 .
## CPI 12.166 3.029 4.017 0.000278 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 237.7 on 37 degrees of freedom
## Multiple R-squared: 0.3037, Adjusted R-squared: 0.2848
## F-statistic: 16.13 on 1 and 37 DF, p-value: 0.0002775
The adjusted R-square is 28.48%. Also, we are unable to reject the hypothsis that CPI are independent with S&P 500 index because of the low p-value. One unit increase in CPI cause 12.166 increase in S&P 500 index.
Then, PMI is added.
sres2<-lm(SP500~PMI+CPI);
summary(sres2);
##
## Call:
## lm(formula = SP500 ~ PMI + CPI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -345.11 -199.60 12.21 148.45 365.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1981.763 658.329 -3.010 0.004747 **
## PMI 19.615 7.115 2.757 0.009101 **
## CPI 10.586 2.848 3.716 0.000684 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219 on 36 degrees of freedom
## Multiple R-squared: 0.4251, Adjusted R-squared: 0.3931
## F-statistic: 13.31 on 2 and 36 DF, p-value: 4.713e-05
library(scatterplot3d)
md <- scatterplot3d(PMI,CPI,SP500,pch = 21, main = "Regression plane",bg = 'blue',xlab = "UR", ylab = "CPI", zlab = "SP500",axis = TRUE);
md$plane3d(sres2)
The adjusted R-square increased to 39.31% in this multiply model. Again, we are unable to deduce that both CPI and PMI are independent with the index because they are significant from 1% significance level. Thus, this multiply model is preferred.
Finally, all three variables are used.
summary(res3)
##
## Call:
## lm(formula = SP500 ~ UR + CPI + PMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.635 -50.172 6.319 55.880 101.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3022.5755 198.3407 -15.239 < 2e-16 ***
## UR -119.2579 6.0264 -19.789 < 2e-16 ***
## CPI 19.1448 0.9337 20.505 < 2e-16 ***
## PMI 19.9767 2.0668 9.666 2.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.61 on 35 degrees of freedom
## Multiple R-squared: 0.9528, Adjusted R-squared: 0.9488
## F-statistic: 235.7 on 3 and 35 DF, p-value: < 2.2e-16
Same as the result of model 3 in Hierarchical section. Because all the coefficient on these independent variables are significant and the adjusted R-square increase a lot. Final Model should be:
Keep other variables constant, One unit PMI increase contributes to 19.9767 increase in S&P, one unit CPI increase contributes to 19.1448 increase in S&P and one unit increase in UR leads to 119.257 decrease in index.
layout(matrix(c(2,1), 2, 1, byrow = TRUE),widths=c(3,1), heights=c(3,3))
model <- resid(res3)
plot(fitted(res3), model, pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0,lwd=2,col="red")
hist(model, main="Model Residual Histogram",xlab = "Fitted value of model")
boxplot(model, main="Boxplot of Residuals");
From the box-plot, we observed that the residuals is a little biased and skew. Because there are only 40 data, we cannot get the result from histogram. In the following project, it is possible to assess the normality assumption of residuals.
The variation in SP500 can be explained by UR, CPI and PMI. Three variables are able to explain 94.88% variations in S&P 500 index. In further study, we will gather more data and assess assumptions of the model.