Dimensional reduction

Looking at the state of the world today we are seeing multiple huge events happening all over. One of the more important things to note in these times is how is it going to affect our surroundings. A very important aspect to this is the fact that we need stability in life and that stability depends on how strong the economy can be. A good measurement of this is to follow the Stock Market, S&P, Dow Jones, or the Nasdaq.

setwd("/Users/myron/Documents/R/CSV Files")
library(readr)
library(ggplot2)
GDP <- read_csv("GDP.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   `Quarter 1` = col_double(),
##   `Quarter 2` = col_double(),
##   `Quarter 3` = col_double(),
##   `Quarter 4` = col_double()
## )

gdp1 <- GDP[,2:5]
View(gdp1)
head(gdp1)

## # A tibble: 6 x 4
##   `Quarter 1` `Quarter 2` `Quarter 3` `Quarter 4`
##         <dbl>       <dbl>       <dbl>       <dbl>
## 1       0.035       0.045       0.04        0.042
## 2       0.023       0.029       0.033       0.031
## 3       0.05        0.03        0.03        0.04 
## 4       0.04        0.04        0.05        0.06 
## 5       0.09        0.08        0.09        0.09 
## 6       0.1         0.01        0.05        0.07

One of the ways that we can measure the Stock Market are the multitude of variables that contribute to its growth or decline. There are many variables that effect the peaks and valleys of the stock market, and there are some variables that are more important then others. With all these variables we need to make sure that we choose the most important variables that directly contribute to the measurement of the market and throw away the rest of the noise. One way this can be examined is by using the Principle Component Analysis method to measure all four quarters of last year and the variables within those quarters to predict where we will be in price by the end of the fourth quarter this year. In our PCA we can see that when we plot the variables, the biggest variation from last year takes place between the first and the second quarters. This variation has such a wide gap due to the fact that in the beginning of last year there were some big issues that were affecting the Gross Domestic Product. For example, between January and June of last year the US was embattled in a trade war with China the second largest economy in the world. The United States felt that it needed to change its trade deals with not just China but the world. So the current administration decided to tariff all countries it felt had unfair deals. China was one of the biggest players in the eyes of the administration. Once tariffs were implemented, it increased revenue coming into the United States. This however was met with a response; China retaliated by targeting American farmers in the Midwest, which ultimately effected our meat and dairy exports. This shows the dramatic drop in the 1st and 2nd quarters. The volatility is directly related to these issues. However, the variation after the first two quarters starts to minimize the gap in quarters 3 and 4 due mainly to the market calming to the good news regarding the trade war conditions being met and averting a disasterous political upheval in Washington DC.

##              PC1        PC2         PC3          PC4
##  [1,] -1.1603979  0.1563365  0.14865800 -0.003876989
##  [2,] -1.2362580  0.1632702  0.13484129 -0.003793614
##  [3,] -1.1871320  0.1099155  0.09517416 -0.011649057
##  [4,] -1.1099983  0.1877939  0.11455703 -0.038742701
##  [5,] -0.8443119  0.2061732  0.09957001  0.006993710
##  [6,] -1.0579925  0.1094893 -0.05900596 -0.037355933
##  [7,] -0.8833676  0.1446648  0.11535774 -0.025527983
##  [8,] -0.9734705  0.1117749  0.03642073  0.004852643
##  [9,] -0.3795095 -0.8119833 -0.72702943  0.464627088
## [10,]  1.8115809 -0.2211746 -0.53350158 -0.647387098
## [11,]  2.6629530 -1.1930698  0.63524176  0.023740721
## [12,]  4.3579045  1.0368093 -0.06028374  0.268119214

## [1] NA

## [1] NA

## [1] NA

## [1] NA NA

## [1] 0.70 0.29

## [1] 0.495

## [1] 0.495

## [1] 0.2899138

## [1] 0.70   NA 0.29   NA

## [1] 0.700 0.495 0.290 0.495

## [1] 0.700 0.495 0.290 0.495

## [1] 0.7000000 0.2899138 0.2900000 0.2899138

Once we have cleaned our data set and found our mean, median and standard deviation we can then run a random forest algorithm which will tell us what are the most predictive attributes in our data set. Another way to approach this is to take another look at predictive data by using the Random Forest Method. Before this method can be used however, I had to clean the data to get it as accurate as possible. I had to use a Missing Value Ratio algorithm to get an average number to fill in the missing spaces. Once the data has been cleaned we can then implement the Random Forest Method. Like the previous PCA method the RF method shows that there is major volatility due to the big issues with the trade war and impending impeachment. Hesitation regarding US companies, imports, and tariffs along with the disruption to our political system will cause major swings and volatility. However, we can also see that after the few couple of quarters of volatility that the stock markets levels off pretty steadily, almost unchanging. If we want to use this as a predictive model we must be careful of the predictive percentage. To project prices for next year seems to be wildly inaccurate which shows a prediction rate of 25 percent. This and the fact that we know once a trade deal has been reached and tariffs removed from both countries the stock market will shoot up and peoples confidence in the market will be strong. Those observations do not seem to be reflected within this method which I would view to be a less accurate means for prediction then the Principle Component analysis method.

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   Quarter1 = col_double(),
##   Quarter2 = col_double(),
##   Quarter3 = col_double(),
##   Quarter4 = col_double()
## )

## Classes 'tbl_df', 'tbl' and 'data.frame':    12 obs. of  4 variables:
##  $ Quarter1: num  0.035 0.023 0.05 0.04 0.09 0.1 0.09 0.1 0.6 0.7 ...
##  $ Quarter2: num  0.045 0.029 0.03 0.04 0.08 0.01 0.08 0.05 0.045 0.495 ...
##  $ Quarter3: num  0.04 0.033 0.03 0.05 0.09 0.05 0.07 0.06 0.04 0.29 ...
##  $ Quarter4: num  0.042 0.031 0.04 0.06 0.09 0.07 0.09 0.07 0.042 0.495 ...

##        
##         0.197375566666666 0.2564763 0.513985266666664
##   0.035                 0         1                 0
##   0.09                  1         0                 0
##   0.875                 0         0                 1
##   0.9                   0         0                 1

## [1] 0.25

Stepwise regression is the last means by which we can analyze the stock market. What we see is on quarter 2 there is still volitality represented on a similar plane like with our principle component analysis. It shows that quarter 2 has three starts, while quarter 3 and 4 both slow down and even out with one and two starts respectively. This may be a second primary method to go on, however we must keep in mind, stepwise regression is very useful for sample sizes but a little less accurate for big picture senarios, and because we are examining hugh amounts of data regarding the US stock market, we need more accurate predicitons that are not just sample size. This would tell us that out of the multitude of algorithms we have utilized here we would want to go with the PCA analysis for the most accurate prediciton.

## Warning: package 'tidyverse' was built under R version 3.6.3

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v stringr 1.4.0
## v tidyr   1.0.2     v forcats 0.5.0
## v purrr   0.3.3

## Warning: package 'forcats' was built under R version 3.6.3

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()       masks randomForest::combine()
## x dplyr::filter()        masks stats::filter()
## x dplyr::lag()           masks stats::lag()
## x randomForest::margin() masks ggplot2::margin()

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Warning: package 'leaps' was built under R version 3.6.3

## Warning: package 'MASS' was built under R version 3.6.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   Quarter1 = col_double(),
##   Quarter2 = col_double(),
##   Quarter3 = col_double(),
##   Quarter4 = col_double()
## )

## Subset selection object
## Call: regsubsets.formula(Quarter1 ~ ., data = Training, nvmax = 5, 
##     method = "seqrep")
## 3 Variables  (and intercept)
##          Forced in Forced out
## Quarter2     FALSE      FALSE
## Quarter3     FALSE      FALSE
## Quarter4     FALSE      FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: 'sequential replacement'
##          Quarter2 Quarter3 Quarter4
## 1  ( 1 ) "*"      " "      " "     
## 2  ( 1 ) "*"      " "      "*"     
## 3  ( 1 ) "*"      "*"      "*"

## 
## Call:
## lm(formula = Quarter1 ~ Quarter2 + Quarter4, data = Training)
## 
## Residuals:
##        1        2        3        4        5        6        7 
## -0.22552  0.01791 -0.10541  0.06760  0.26895 -0.05521  0.03168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    0.245      0.112   2.187    0.094 .
## Quarter2      14.239      7.781   1.830    0.141  
## Quarter4     -13.209      7.737  -1.707    0.163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1892 on 4 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.7293 
## F-statistic: 9.082 on 2 and 4 DF,  p-value: 0.03257

Dimensional reduction

Brent Lund

3/22/2020