Introduction

The main objective of this project is to explore the New York Stock Exchange Kaggle set and perform data analysis on the stock prices of apple.

Investing in stocks is always a good option as it is one of the best way your money grows, the bigger problem in this regard is there is no good stock all the time, and investing at a bad time can be disasterous, while investment in a low value stock at the right time can make your huge profits.

To determine if a stock is good or bad it becomes necessary to see the patterns based on which we can make a conclusion in regard to how a good stock behaves. Here we will analyse the patterns in the stock price of apple.

Solutions Overview: After the cleaning and exploratory analysis we will draw a conclusion about trends for a good stock based the analysis.

Packages Required:

dplyr: Used for data manipulation
DT: Used to display data on screen
readr: used to provide fast and friendly way to read data

library(readr)
library(dplyr)
library(DT)
library(knitr)

Data Preparation

Data Import

The data has been taken from New York Stock Exchange dataset from Kaggle. This data set contains 7 variables with 951264 observations. The data set has the values of the opening and closing price of the stocks of companies listed on NYSE over a period of 6 years from 2010 to 2016.

directory <- "/Users/akul/Desktop/Kaggle -NYSE/prices.csv"
fundamentals_dir <- "/Users/akul/Desktop/Kaggle -NYSE/fundamentals.csv"
nyse_data <- read_csv(directory)
fundamentals_data <- read_csv(fundamentals_dir)

The imported dataset has 7 variables with 951264 observations

dim(nyse_data)

## [1] 851264      7

*Names of variables in the dataset

names(nyse_data)

## [1] "date"   "symbol" "open"   "close"  "low"    "high"   "volume"

Overview of the imported dataset.

glimpse(nyse_data)

## Observations: 851,264
## Variables: 7
## $ date   <dttm> 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, 2016-0...
## $ symbol <chr> "WLTW", "WLTW", "WLTW", "WLTW", "WLTW", "WLTW", "WLTW",...
## $ open   <dbl> 123.43, 125.24, 116.38, 115.48, 117.01, 115.51, 116.46,...
## $ close  <dbl> 125.84, 119.98, 114.95, 116.62, 114.97, 115.55, 112.85,...
## $ low    <dbl> 122.31, 119.94, 114.93, 113.50, 114.09, 114.50, 112.59,...
## $ high   <dbl> 126.25, 125.54, 119.74, 117.44, 117.33, 116.06, 117.07,...
## $ volume <dbl> 2163600, 2386400, 2489500, 2006300, 1408600, 1098000, 9...

Data Cleaning

The first part of the data cleaning process involves checking if there are any variable names in the rows. We can see from the overview of the dataset, that the looks fairly clean. The next step is checking the data for missing values.

sum(is.na(nyse_data))

Here, we find there are no missing values in the dataset.

Now, we move on to the selecting the variables of interest, since our analysis is based on the price of stock of ‘Apple Inc’, we create a new dataset containing this information. In addition a new column name fluctuation is introduced which gives us the rise or drop in price of the stock for that day.

 nyse_apple <- nyse_data %>% mutate(fluctuation = close - open) %>% 
  filter(symbol == "AAPL") %>% arrange(date)
names(fundamentals_data)[2] = "symbol"
fundamentals_data_apple <- filter(fundamentals_data,symbol == "AAPL")

 dim(nyse_apple)

## [1] 1762    8

 dim(fundamentals_data_apple)

## [1]  4 79

The new table created for the stock prices of apple has 1762 observations and 8 variables.

a new table fundamentals_data_apple is created that contains the features on basis of which the stock prices are evaluated.

Now we take the summary statistics of the data based on which we can identify if there are any abnormal values that are present.

kable(summary(nyse_apple))

date	symbol	open	close	low	high	volume	fluctuation
Min. :2010-01-04 00:00:00	Length:1762	Min. : 90.0	Min. : 90.28	Min. : 89.47	Min. : 90.7	Min. : 11475900	Min. :-30.11999
1st Qu.:2011-09-30 18:00:00	Class :character	1st Qu.:115.2	1st Qu.:115.19	1st Qu.:114.00	1st Qu.:116.4	1st Qu.: 49174775	1st Qu.: -1.97000
Median :2013-07-04 00:00:00	Mode :character	Median :318.2	Median :318.24	Median :316.55	Median :320.6	Median : 80503850	Median : 0.04499
Mean :2013-07-02 22:20:17	NA	Mean :313.1	Mean :312.93	Mean :309.83	Mean :315.9	Mean : 94225776	Mean : -0.14925
3rd Qu.:2015-04-05 00:00:00	NA	3rd Qu.:470.9	3rd Qu.:472.59	3rd Qu.:467.97	3rd Qu.:478.1	3rd Qu.:121081625	3rd Qu.: 1.70001
Max. :2016-12-30 00:00:00	NA	Max. :702.4	Max. :702.10	Max. :699.57	Max. :705.1	Max. :470249500	Max. : 30.76001

From the summary table it is clear that there are no negative values for the columns of open, close, high, low, volume. Thus we eliminate any possibility of abnormal values though we might have some outliers.

Data Preview

datatable(head(nyse_apple,50))

Proposed Exploratory Data Analysis

As we wish to identify the trends in the stocks, it is mandatory to convert the data into a time-series data for further analysis to identify trends and study their relationship towards the factors in fundamentals_data_apple dataset which contains above 70 attributes on which the price of the stocks can vary, hence doing correlation study between predictor and the response variable becomes crucial to eliminate the unnecessary variables. For the modelling process the data will be split in 80 to 20 ratio for the training and test set.

General plots between predictor and response variables and correlation statistics will provide a deeper insight on the parameter selection process. The parameter seletion process is not quite clear at the moment. In the end,I plan to employ a neural network to further the analysis to predict stock prices.

DATA WRANGLING PROJECT