1 Explaination

This project is purposed to give investors some stock insights about S&P Stock Index.

2 Dataset Overview

The dataset that I analyze is stock index S&P 500 from 2013 to 2018. It contains 505 stocks in 6 columns:

  1. date -> stock transaction date
  2. open -> opening stock price
  3. high -> highest stock price
  4. low close -> lowest stock price
  5. volume -> volume transaction
  6. name -> stock name

source: kaggle

##         date  open  high   low close   volume Name
## 1 2013-02-08 15.07 15.12 14.63 14.75  8407500  AAL
## 2 2013-02-11 14.89 15.01 14.26 14.46  8882000  AAL
## 3 2013-02-12 14.45 14.51 14.10 14.27  8126000  AAL
## 4 2013-02-13 14.30 14.94 14.25 14.66 10259500  AAL
## 5 2013-02-14 14.94 14.96 13.16 13.99 31879900  AAL
## 6 2013-02-15 13.93 14.61 13.93 14.50 15628000  AAL

I will create a copy of imported dataframe, so If I did something wrong I can revert the dataframe back without re-import dataframe again.

stock = data.frame(stock_old)

3 EDA

3.1 check data type

## 'data.frame':    619040 obs. of  7 variables:
##  $ date  : chr  "2013-02-08" "2013-02-11" "2013-02-12" "2013-02-13" ...
##  $ open  : num  15.1 14.9 14.4 14.3 14.9 ...
##  $ high  : num  15.1 15 14.5 14.9 15 ...
##  $ low   : num  14.6 14.3 14.1 14.2 13.2 ...
##  $ close : num  14.8 14.5 14.3 14.7 14 ...
##  $ volume: int  8407500 8882000 8126000 10259500 31879900 15628000 11354400 14725200 11922100 6071400 ...
##  $ Name  : chr  "AAL" "AAL" "AAL" "AAL" ...

as we can see there are, 2 columns with inappropriate data type: date and name.

so I will change data type of column date to date and column name to factor

3.2 change data type

## 'data.frame':    619040 obs. of  7 variables:
##  $ date  : Date, format: "2013-02-08" "2013-02-11" ...
##  $ open  : num  15.1 14.9 14.4 14.3 14.9 ...
##  $ high  : num  15.1 15 14.5 14.9 15 ...
##  $ low   : num  14.6 14.3 14.1 14.2 13.2 ...
##  $ close : num  14.8 14.5 14.3 14.7 14 ...
##  $ volume: int  8407500 8882000 8126000 10259500 31879900 15628000 11354400 14725200 11922100 6071400 ...
##  $ Name  : Factor w/ 505 levels "A","AAL","AAP",..: 2 2 2 2 2 2 2 2 2 2 ...

3.3 add missing day

lubridate::wday(stock$date, label = T)[1:7]
## [1] Fri Mon Tue Wed Thu Fri Tue
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

in order to make line graph, the date must be available sequentially but as we can see there are no date on Saturday and Sunday

so I decide to make a date range then fill the value with the previous value

in order to make the date range, we need to get the min and max date value

## [1] "min date: 2013-02-08"
## [1] "max date: 2018-02-07"

3.4 check empty value

##   date   open   high    low  close volume   Name 
##      0 303101 303098 303098 303090 303090      0
NA_df <- stock[!complete.cases(stock),]

head(NA_df)
##          date open high low close volume Name
## 2  2013-02-09   NA   NA  NA    NA     NA    A
## 3  2013-02-10   NA   NA  NA    NA     NA    A
## 9  2013-02-16   NA   NA  NA    NA     NA    A
## 10 2013-02-17   NA   NA  NA    NA     NA    A
## 11 2013-02-18   NA   NA  NA    NA     NA    A
## 16 2013-02-23   NA   NA  NA    NA     NA    A

3.4.1 Null data inspection

unique(weekdays(NA_df[1:20, "date"]))
## [1] "Saturday" "Sunday"   "Monday"   "Friday"

Turns out after I did a null data inspection, I got 2 problems on the dataframe: 1. After fill missing date, the value of open, high, low, and close columns for the corresponding added date is NA. 2. There are some value is missing on weekdays, this might be because on these days, there are some special events, for ex: international holiday.

After the inspection, I decided to use fill function to fill NA stock prices with price the day before

stock <- stock %>% 
  fill(open, high, low, close, volume, .direction="down")
anyNA(stock)
## [1] FALSE

There are no NA value on dataframe, so it ready to be analyzed

4 Data Summary

##       date                 open              high              low         
##  Min.   :2013-02-08   Min.   :   1.62   Min.   :   1.69   Min.   :   1.50  
##  1st Qu.:2014-05-10   1st Qu.:  40.25   1st Qu.:  40.64   1st Qu.:  39.86  
##  Median :2015-08-09   Median :  62.78   Median :  63.34   Median :  62.21  
##  Mean   :2015-08-09   Mean   :  82.91   Mean   :  83.69   Mean   :  82.15  
##  3rd Qu.:2016-11-08   3rd Qu.:  94.17   3rd Qu.:  94.96   3rd Qu.:  93.36  
##  Max.   :2018-02-07   Max.   :2044.00   Max.   :2067.99   Max.   :2035.11  
##                                                                            
##      close             volume               Name       
##  Min.   :   1.59   Min.   :        0   A      :  1826  
##  1st Qu.:  40.27   1st Qu.:  1069482   AAL    :  1826  
##  Median :  62.81   Median :  2087317   AAP    :  1826  
##  Mean   :  82.94   Mean   :  4390521   AAPL   :  1826  
##  3rd Qu.:  94.22   3rd Qu.:  4364796   ABBV   :  1826  
##  Max.   :2049.00   Max.   :618237630   ABC    :  1826  
##                                        (Other):911174

Insight:

  • This stock data is collected from February, 8th 2013 util February, 7th 2018

  • If we compare median and mean of all stock price (open, high, low, and close) respectively. It seems the value is pretty equal, this is because we generate statistic descriptive from all stocks.

  • Based on my research on investopedia.com, S&P 500 is one of most-widely watched index globally. But as we can see there’s one or some stock that have no volume transaction

  • The value of 1st quarter is increased over time to 3rd quarter across all stock prices

5 Business questions

In order to give investors insight about this index, I will make several business questions:

  1. What’s top 5 stocks in terms of price increase?
## # A tibble: 5 × 2
##   Name  price_increase
##   <fct> <chr>         
## 1 DE    $99.71        
## 2 COST  $99.46        
## 3 PNC   $98.39        
## 4 IBM   $97.95        
## 5 CELG  $97.14

Insight:

As we can see, top 5 stocks in price growth are: - Deere & Company (DE) - Costco Wholesale Corporation (COST) - Financial Services Group Inc (PNC) - IBM - Celgene Corporation (CELG)

  1. What’s the stock that had the highest volume for each year?
## # A tibble: 6 × 3
## # Groups:   year [6]
##    year Name  volume_sum  
##   <dbl> <fct> <chr>       
## 1  2013 XLNX  $997,577,275
## 2  2014 KSS   $995,961,962
## 3  2015 CMS   $998,214,711
## 4  2016 AET   $994,782,004
## 5  2017 DISH  $996,217,505
## 6  2018 AMT   $99,070,512

Insight:

The analysis result says that, Bank of America Corp (BAC) is the most stock traded in terms of volume. 5 times in a row from 2013 - 2017. But in 2018, General Electric Co (GE) took over that record

  1. What’s top 5 stocks for momentum on the last 10 days?

Cited from investopedia.com Momentum is one of indicator to measure rice or fall in stock price. This article also tell us that bull market (increasing stock price) is tend to last longer than bear market (falling stock price). This metric will help traders to decide weather they want to buy a stock or not using momentum.

The formula for momentum is:

\(V - V{x}\)

where: \(V\) = latest price \(V{x}\) = closing price \(x\) = number of days ago

src: https://www.investopedia.com/articles/technical/081501.asp#:~:text=Market%20momentum%20is%20measured%20by,plotted%20around%20a%20zero%20line.

last_date <- stock %>% 
  slice_tail %>% 
  select(date)

last_date
##         date
## 1 2018-02-07
closing_price_10_days_ago <- stock %>%
  filter(date == last_date$date-10) %>% 
  group_by(Name) %>%
  select(Name, close) %>% 
  arrange(-close) %>% 
  head(5)

closing_price_10_days_ago
## # A tibble: 5 × 2
## # Groups:   Name [5]
##   Name  close
##   <fct> <dbl>
## 1 PCLN  1950.
## 2 AMZN  1402.
## 3 GOOGL 1188.
## 4 GOOG  1176.
## 5 AZO    797.

Insight: - Booking Holdings Inc (PCLN) - Amazon (AMZN) - GOOGL (Google) - Alphabet Inc Class C (GOOG) - Autozone Inc (AZO)

Those stocks are top 5 in terms of momentum on thes last 10 days.

The interesting fact about this finding is GOOGL, AMZN, and GOOG are stocks in sector technology. So, in 2018 technology sector stocks are already traded frequently.