By Akul Mahajan
The main objective of this project is to explore the New York Stock Exchange Kaggle set and perform data analysis on the stock prices of apple.
Investing in stocks is always a good option as it is one of the best way your money grows, the bigger problem in this regard is there is no good stock all the time, and investing at a bad time can be disasterous, while investment in a low value stock at the right time can make your huge profits.
To determine if a stock is good or bad it becomes necessary to see the patterns based on which we can make a conclusion in regard to how a good stock behaves. Here we will analyse the patterns in the stock price of apple.
Solutions Overview: After the cleaning and exploratory analysis we will draw a conclusion about trends for a good stock based the analysis.
library(readr)
library(dplyr)
library(DT)
library(knitr)
The data has been taken from New York Stock Exchange dataset from Kaggle. This data set contains 7 variables with 951264 observations. The data set has the values of the opening and closing price of the stocks of companies listed on NYSE over a period of 6 years from 2010 to 2016.
directory <- "/Users/akul/Desktop/Kaggle -NYSE/prices.csv"
fundamentals_dir <- "/Users/akul/Desktop/Kaggle -NYSE/fundamentals.csv"
nyse_data <- read_csv(directory)
fundamentals_data <- read_csv(fundamentals_dir)
The imported dataset has 7 variables with 951264 observations
dim(nyse_data)
## [1] 851264 7
*Names of variables in the dataset
names(nyse_data)
## [1] "date" "symbol" "open" "close" "low" "high" "volume"
Overview of the imported dataset.
glimpse(nyse_data)
## Observations: 851,264
## Variables: 7
## $ date <dttm> 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, 2016-0...
## $ symbol <chr> "WLTW", "WLTW", "WLTW", "WLTW", "WLTW", "WLTW", "WLTW",...
## $ open <dbl> 123.43, 125.24, 116.38, 115.48, 117.01, 115.51, 116.46,...
## $ close <dbl> 125.84, 119.98, 114.95, 116.62, 114.97, 115.55, 112.85,...
## $ low <dbl> 122.31, 119.94, 114.93, 113.50, 114.09, 114.50, 112.59,...
## $ high <dbl> 126.25, 125.54, 119.74, 117.44, 117.33, 116.06, 117.07,...
## $ volume <dbl> 2163600, 2386400, 2489500, 2006300, 1408600, 1098000, 9...
The first part of the data cleaning process involves checking if there are any variable names in the rows. We can see from the overview of the dataset, that the looks fairly clean. The next step is checking the data for missing values.
sum(is.na(nyse_data))
Here, we find there are no missing values in the dataset.
Now, we move on to the selecting the variables of interest, since our analysis is based on the price of stock of ‘Apple Inc’, we create a new dataset containing this information. In addition a new column name fluctuation is introduced which gives us the rise or drop in price of the stock for that day.
nyse_apple <- nyse_data %>% mutate(fluctuation = close - open) %>%
filter(symbol == "AAPL") %>% arrange(date)
names(fundamentals_data)[2] = "symbol"
fundamentals_data_apple <- filter(fundamentals_data,symbol == "AAPL")
dim(nyse_apple)
## [1] 1762 8
dim(fundamentals_data_apple)
## [1] 4 79
The new table created for the stock prices of apple has 1762 observations and 8 variables.
a new table fundamentals_data_apple is created that contains the features on basis of which the stock prices are evaluated.
Now we take the summary statistics of the data based on which we can identify if there are any abnormal values that are present.
kable(summary(nyse_apple))
| date | symbol | open | close | low | high | volume | fluctuation | |
|---|---|---|---|---|---|---|---|---|
| Min. :2010-01-04 00:00:00 | Length:1762 | Min. : 90.0 | Min. : 90.28 | Min. : 89.47 | Min. : 90.7 | Min. : 11475900 | Min. :-30.11999 | |
| 1st Qu.:2011-09-30 18:00:00 | Class :character | 1st Qu.:115.2 | 1st Qu.:115.19 | 1st Qu.:114.00 | 1st Qu.:116.4 | 1st Qu.: 49174775 | 1st Qu.: -1.97000 | |
| Median :2013-07-04 00:00:00 | Mode :character | Median :318.2 | Median :318.24 | Median :316.55 | Median :320.6 | Median : 80503850 | Median : 0.04499 | |
| Mean :2013-07-02 22:20:17 | NA | Mean :313.1 | Mean :312.93 | Mean :309.83 | Mean :315.9 | Mean : 94225776 | Mean : -0.14925 | |
| 3rd Qu.:2015-04-05 00:00:00 | NA | 3rd Qu.:470.9 | 3rd Qu.:472.59 | 3rd Qu.:467.97 | 3rd Qu.:478.1 | 3rd Qu.:121081625 | 3rd Qu.: 1.70001 | |
| Max. :2016-12-30 00:00:00 | NA | Max. :702.4 | Max. :702.10 | Max. :699.57 | Max. :705.1 | Max. :470249500 | Max. : 30.76001 |
From the summary table it is clear that there are no negative values for the columns of open, close, high, low, volume. Thus we eliminate any possibility of abnormal values though we might have some outliers.
datatable(head(nyse_apple,50))
As we wish to identify the trends in the stocks, it is mandatory to convert the data into a time-series data for further analysis to identify trends and study their relationship towards the factors in fundamentals_data_apple dataset which contains above 70 attributes on which the price of the stocks can vary, hence doing correlation study between predictor and the response variable becomes crucial to eliminate the unnecessary variables. For the modelling process the data will be split in 80 to 20 ratio for the training and test set.
General plots between predictor and response variables and correlation statistics will provide a deeper insight on the parameter selection process. The parameter seletion process is not quite clear at the moment. In the end,I plan to employ a neural network to further the analysis to predict stock prices.