MATH2349 Semester 2, 2019

Required packages

library("gdata")
library("dplyr")
library("stringr")
library("lubridate")
library("tidyr")
library("outliers")
library("forecast")
library("infotheo")

Executive Summary

This assignment uses three datasets from Australian Securities Exchange (ASX). The first dataset has the ASX companies, their ASX Code and the industries they belong to, and the second and third datasets have the share price of these ASX companies at the last trading day of financial year 2018 and 2019.

In order to analyse the share price changes between 2018 and 2019, this assignment has taken the below steps:
1. Join the three datasets together which includes activities like changing variable names and trimming the spaces of the key values, so that the datasets can be joined properly.
2. Data type conversion. As all the datasets are scraped from web directly, the data types are mostly chat or factor. This step does the necessary conversion (e.g.: convert Share Price variable from to numeric, Business Date variable to date type) and order the Industry Group factor variable.
3. Tidy & Manipulate Data. Remove duplicate variables due to the join from the merged dataset, filter the dataset to focus on “Ordinary” security (stock shares) and mutate/calculate the delta price of the share between year 2018 and 2019.
4. Scan and handle missing value and inconsistency. As each ASX company and their share price can be very different, this assignment has to remove observations with missing share price value rather than imputing the price from other companies. Also, observations with inconsistent business date also need to be excluded, otherwise the delta price won’t reflect the price changes between the two years.
5. Scan and handle outliers. Scan each numeric variable and handle the outliers found.
6. Data transformation. Focusing on variable of Delta.Percent (share price change by percentage), this step handles the right skewness of the data and also discretise the data by equal-frequency binning.

Data

Data Source:
“Complete List + GICS Industry Group”. 2019. “https://www.marketindex.com.au/data-downloads”
“30 June 2019 - Closing Share Prices”. 2019. “https://www.marketindex.com.au/data-downloads”
“30 June 2018 - Closing Share Prices”. 2019. “https://www.marketindex.com.au/data-downloads”

Data Description:
1. Dataset “Complete List + GICS Industry Group”:
* Overall: Contains the ASX Listed Company Names, their ASX Code and the industries they belong to.
* Variable - “Company.Name”: Full Name of the ASX Listed company.
* Variable - “ASX.Code”: ASX Code of the company.
* Variable - “Industry.Group”: The industry group the ASX Company belongs to.
2. Dataset “30 June 2019 - Closing Share Prices.” and “30 June 2018 - Closing Share Prices.”
* Overall: The two datasets have the same structure/varibales and contain the closing share prices of financial year 2019 and 2018.
* Variable - “ASX.Code”: ASX Code of the company.
* Variable - “Security.Group.Code”: Code of the financial security group.
* Variable - “Issuer.Underlying.Issuer.Full.Name”: Full name of the company that issues the financial security.
* Variable - “Abbreviated.Product.Description”: Short description of the product/share/security.
* Variable - “Last.Price….”: Price of the last trading day of the financial year for the company’s product.
* Variable - “Business.Date”: Date of the last trading day of the financial year for the company’s product.

# Step 1: Scrape data from the data source. Using read.csv() from Base R to scrape online CSV file, and read.xls() from gdata package to scrape online XLS file.

url1 <- "https://www.asx.com.au/asx/research/ASXListedCompanies.csv"
ASXCompany <- read.csv(url1,skip = 2,strip.white = TRUE,stringsAsFactors = FALSE)

url2 <- "http://www.marketindex.com.au/sites/default/files/data-downloads/30-june-2018.xls"
ASX2018 <- read.xls(url2, sheet = 1, perl = "C:/Strawberry/perl/bin/perl.exe")

url3 <- "http://www.marketindex.com.au/sites/default/files/data-downloads/30-june-2019.xls"
ASX2019 <- read.xls(url3, sheet = 1, perl = "C:/Strawberry/perl/bin/perl.exe")

# Step 2: Rename some column names to make the following operation easier and prepare for the dataset join.
names(ASXCompany)[names(ASXCompany) == "ASX.code"] <- "ASX.Code"
names(ASXCompany)[names(ASXCompany) == "GICS.industry.group"] <- "Industry.Group"
names(ASXCompany)[names(ASXCompany) == "Company.name"] <- "Company.Name"

names(ASX2018)[names(ASX2018) == "Last.Price...."] <- "2018.Last.Price"
names(ASX2018)[names(ASX2018) == "Business.Date"] <- "2018.Business.Date"
names(ASX2018)[names(ASX2018) == "Abbreviated.Product.Description"] <- "Product.Description"

names(ASX2019)[names(ASX2019) == "Last.Price...."] <- "2019.Last.Price"
names(ASX2019)[names(ASX2019) == "Business.Date"] <- "2019.Business.Date"
names(ASX2019)[names(ASX2019) == "Abbreviated.Product.Description"] <- "Product.Description"

# step 3: Remove the leading and trailing space of varibale "ASX.Code" in the two datasets, so that it can be used as key to join with the "ASXCompany" dataset. Create/Muate a new variable of "Company.Code" to store the trimmed variable.
ASX2018 <- ASX2018 %>% mutate(Company.Code = str_trim(ASX.Code, side = "both"))
ASX2019 <- ASX2019 %>% mutate(Company.Code = str_trim(ASX.Code, side = "both"))

# Step 4: Join the three datasets to dataset "ASXALL" which contains data that come from all three datasets.
ASXAll <- ASXCompany %>% left_join(ASX2018, by=c("ASX.Code"="Company.Code")) %>% left_join(ASX2019, by=c("ASX.Code"="Company.Code"))

# Step 5: Provide outputs (head of data sets) as the assigment instruction requires.
head(ASXCompany)

head(ASX2018)

head(ASX2019)

head(ASXAll)

Understand

# Step 1: Convert "2018.Last.Price" and "2019.Last.Price" from factor to numeric type.
ASXAll$"2018.Last.Price" <- as.numeric(as.character(ASXAll$"2018.Last.Price"))

## Warning: NAs introduced by coercion

ASXAll$"2019.Last.Price" <- as.numeric(as.character(ASXAll$"2019.Last.Price"))

## Warning: NAs introduced by coercion

# Step 2: Convert "2018.Business.Date" and "2019.Business.Date" from factor to date type.
ASXAll$"2018.Business.Date" <- ymd(ASXAll$"2018.Business.Date")

## Warning: 6 failed to parse.

ASXAll$"2019.Business.Date" <- ymd(ASXAll$"2019.Business.Date")

## Warning: 17 failed to parse.

# Step 3: Convert "Industry.Group" from chr to ordered factor
ASXAll$"Industry.Group" <- factor(ASXAll$"Industry.Group", ordered = TRUE)

# Step 4: Check the types and structure of the dataset "ASXAll" after the above type conversion.
str(ASXAll)

## 'data.frame':    2188 obs. of  15 variables:
##  $ Company.Name                        : chr  "MOQ LIMITED" "1300 SMILES LIMITED" "1414 DEGREES LIMITED" "1ST GROUP LIMITED" ...
##  $ ASX.Code                            : chr  "MOQ" "ONT" "14D" "1ST" ...
##  $ Industry.Group                      : Ord.factor w/ 26 levels "Automobiles & Components"<..: 22 12 3 12 5 19 17 9 15 7 ...
##  $ ASX.Code.y                          : Factor w/ 5700 levels "1AD   ","1AG   ",..: 2920 3470 NA 5 4638 4719 4851 4706 1499 7 ...
##  $ Security.Group.Code.x               : Factor w/ 6 levels "CNV","COY OPTION",..: 3 3 NA 3 3 3 3 3 3 3 ...
##  $ Issuer.Underlying.Issuer.Full.Name.x: Factor w/ 2302 levels " ISHARES S&P MID-CAP ETF                                    ",..: 2 4 NA 5 6 7 8 9 10 11 ...
##  $ Product.Description.x               : Factor w/ 3279 levels "0.01C PD 24.99C UNPD",..: 2721 2721 NA 2721 2721 2735 2733 2721 2721 2721 ...
##  $ 2018.Last.Price                     : num  0.24 6.4 NA 0.038 0.002 ...
##  $ 2018.Business.Date                  : Date, format: "2018-06-28" "2018-06-28" ...
##  $ ASX.Code.y.y                        : Factor w/ 5912 levels "14D   ","14DO  ",..: 3032 3572 1 7 4751 4823 4944 4809 1534 11 ...
##  $ Security.Group.Code.y               : Factor w/ 6 levels "CNV","COY OPTION",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Issuer.Underlying.Issuer.Full.Name.y: Factor w/ 2416 levels " MOQ LIMITED                                                ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Product.Description.y               : Factor w/ 3458 levels " 5.5%   LN 15-NOV-18",..: 3152 3152 3152 3152 3152 2949 3224 3152 3152 3152 ...
##  $ 2019.Last.Price                     : num  0.195 6.35 0.295 0.068 0.001 1.03 1.18 0.11 0.001 0.98 ...
##  $ 2019.Business.Date                  : Date, format: "2019-06-28" "2019-06-28" ...

Tidy & Manipulate Data I

# Step 1: Check the dataset before the above operations.
ASXAll

# Step 2: Remove duplicate variables due to the earlier data join from the dataset
ASXAll <- ASXAll %>% select(-c("ASX.Code.y", "ASX.Code.y.y", "Security.Group.Code.y", "Issuer.Underlying.Issuer.Full.Name.y", "Product.Description.y"))

# Step 3: Rename some columns to make following operation easier
names(ASXAll)[names(ASXAll) == "Security.Group.Code.x"] <- "Security.Group.Code"
names(ASXAll)[names(ASXAll) == "Issuer.Underlying.Issuer.Full.Name.x"] <- "Issuer.Name"
names(ASXAll)[names(ASXAll) == "Product.Description.x"] <- "Product.Description"

# Step 4: Filter the dataset and keep only data for ordinary share (Product.Description = Ordinary)
ASXAll <- ASXAll %>% filter(str_trim(Product.Description) == "ORDINARY")

# Step 5: Check the dataset after the above operations.
ASXAll

As displayed above, the data now conforms with the tidy data principle that each variable has its column, each observation has its row and each value has its cell.

Tidy & Manipulate Data II

# Step 1: Mutate/Calculate the Delta Price of the share between year 2018 and 2019. Delta.Price: the price changes between year 2019 and 2018; Delta.Percent: Share price change by percentage.
ASXAll <- ASXAll %>% mutate(Delta.Price = ASXAll$`2019.Last.Price` - ASXAll$`2018.Last.Price`)
ASXAll <- ASXAll %>% mutate(Delta.Percent = (ASXAll$`2019.Last.Price` - ASXAll$`2018.Last.Price`) / ASXAll$`2018.Last.Price`)

# Step 2: Check data output
ASXAll

Scan I

# Step 1: Scan and check if and how many missing values there are in each column
colSums(is.na(ASXAll))

##        Company.Name            ASX.Code      Industry.Group 
##                   0                   0                   0 
## Security.Group.Code         Issuer.Name Product.Description 
##                   0                   0                   0 
##     2018.Last.Price  2018.Business.Date     2019.Last.Price 
##                   3                   3                  11 
##  2019.Business.Date         Delta.Price       Delta.Percent 
##                  11                  13                  13

# Step 2: Keep only the observations without missing values using complete.cases(). 
ASXAll <- ASXAll[complete.cases(ASXAll),]

# Step 3: Handle data inconsistency. The Last Price of the share in the year should be based on a date of the same year. 
ASXAll <- ASXAll[(ASXAll$`2018.Business.Date` >= as.Date("2018-01-01") & ASXAll$`2019.Business.Date` >= as.Date("2019-01-01")),]

# Step 4: Check data output
ASXAll

Removed all the observations with missing values because the missing values found are only in the Price columns and if the price values are missing, it won’t make sense (won’t even be possible) to calculate the delta price betwee the two years. It also doesn’t make sense to impute the share price from either other shares or the other years of the same share as each company is unique and has different financial performance year over year, so observations with missing price value needs to be excluded.
Also removed the observations with share price value not come from the same but an earlier year (e.g.: “2019.Last.Price” has “Business Date” of an earlier year), because the main purpose here is to compare the share price performance between 2018 and 2019.

Scan II

# Step 1: Scan for outliers for all the four numeric variables using z-score.
ASXAll$`2018.Last.Price` %>% scores(type="z") %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2235 -0.2207 -0.2084  0.0000 -0.1312 17.7359

ASXAll$`2019.Last.Price` %>% scores(type="z") %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218 -0.2198 -0.2101  0.0000 -0.1394 19.2706

ASXAll$Delta.Price %>% scores(type="z") %>% summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -31.95401  -0.01340   0.00796   0.00000   0.01324   6.65668

ASXAll$Delta.Percent %>% scores(type="z") %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3858 -0.2156 -0.1186  0.0000 -0.0259 22.7995

# Step 2: Remove the outliers according to the z-score result
ASXAll <- ASXAll[-which(abs(ASXAll$`2018.Last.Price` %>% scores(type = "z")) > 3),]
ASXAll <- ASXAll[-which(abs(ASXAll$`2019.Last.Price` %>% scores(type = "z")) > 3),]
ASXAll <- ASXAll[-which(abs(ASXAll$Delta.Price %>% scores(type = "z")) >3 ),]
ASXAll <- ASXAll[-which(abs(ASXAll$Delta.Percent %>% scores(type = "z")) > 3),]

# Step 3: Check data output
ASXAll

z-score has been used to detect outliers, because we would like to have a nomalized distribution of data. Using z-score, it basically assures that the dataset has a standard deviation between 3 and -3, which means the data is neither too high of too low.

Transform

# Step 1: Check variable Delta.Percent of the dataset, which is the main variable in the dataset that reflects the share price changes.
hist(ASXAll$Delta.Percent)

# Step 2: As the histogram shows the variable data is right skewed, use reciprocal transformation to reduce the right skewness.
hist(1/(ASXAll$Delta.Percent))

# Step 3: Bin/Discretize the variable of Delta.Percent with equal frequency binning.
Price_Binned <- ASXAll %>% select("Delta.Percent") %>% discretize(disc="equalfreq")

# Step 4: Check the variable after binning
ASXAll %>% select("Delta.Percent") %>% bind_cols(Price_Binned) %>% head(10)

# Step 5: Check data output
ASXAll

Reciprocal transformation is used because the data is not only right skewed, but also has negative and zero values, for which typical tranformation like log, square root won’t apply for.
Equal frequency binning is used to categorize the share price change to reflect the level of investment return of different shares.

NOTE: Follow the order outlined above in the report as possible as you can. Note that sometimes the order of the tasks may be different than the order given here. Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )