library("gdata")
library("dplyr")
library("stringr")
library("lubridate")
library("tidyr")
library("outliers")
library("forecast")
library("infotheo")
This assignment uses three datasets from Australian Securities Exchange (ASX). The first dataset has the ASX companies, their ASX Code and the industries they belong to, and the second and third datasets have the share price of these ASX companies at the last trading day of financial year 2018 and 2019.
In order to analyse the share price changes between 2018 and 2019, this assignment has taken the below steps:
1. Join the three datasets together which also includes activities like changing variable names and trimming the spaces of the key values, so that the datasets can be joined properly.
2. Data type conversion. As all the datasets are scraped from web directly, the data types are mostly of char or factor type. This step does the necessary conversion (e.g.: convert Share Price variable to numeric, Business Date variable to date type) and order the Industry Group factor variable.
3. Tidy & Manipulate Data. Remove duplicate variables due to the join from the merged dataset, filter the dataset to focus on “Ordinary Security” (stock shares) and mutate/calculate the delta price of the share between year 2018 and 2019.
4. Scan and handle missing value and inconsistency. As each ASX company and their share price can be very different, this assignment has to remove observations with missing share price value rather than imputing the price from other companies which won’t make sense as the financial performance of one company can’t be mutated from that of other companies. Also, observations with inconsistent business date need to be excluded, otherwise the delta price won’t reflect the price changes between the two years.
5. Scan and handle outliers. Scan each numeric variable and handle the outliers found.
6. Data transformation. Focusing on the variable of “Delta.Percent (share price change by percentage)”, this step handles the right skewness of the data and also discretise the data by using equal-frequency binning.
Data Source:
“Complete List + GICS Industry Group”. 2019. “https://www.marketindex.com.au/data-downloads”
“30 June 2019 - Closing Share Prices”. 2019. “https://www.marketindex.com.au/data-downloads”
“30 June 2018 - Closing Share Prices”. 2019. “https://www.marketindex.com.au/data-downloads”
Data Description:
1. Dataset “Complete List + GICS Industry Group”:
* Overall: Contains the ASX Listed Company Names, their ASX Code and the industries they belong to.
* Variable - “Company.Name”: Full Name of the ASX Listed company.
* Variable - “ASX.Code”: ASX Code of the company.
* Variable - “Industry.Group”: The industry group the ASX Company belongs to.
2. Dataset “30 June 2019 - Closing Share Prices.” and “30 June 2018 - Closing Share Prices.”
* Overall: The two datasets have the same structure/varibales and contain the closing share prices of financial year 2019 and 2018.
* Variable - “ASX.Code”: ASX Code of the company.
* Variable - “Security.Group.Code”: Code of the financial security group.
* Variable - “Issuer.Underlying.Issuer.Full.Name”: Full name of the company that issues the financial security.
* Variable - “Abbreviated.Product.Description”: Short description of the product/share/security.
* Variable - “Last.Price….”: Price of the last trading day of the financial year for the company’s product.
* Variable - “Business.Date”: Date of the last trading day of the financial year for the company’s product.
# Step 1: Scrape data from the data source. Using read.csv() from Base R to scrape online CSV file, and read.xls() from gdata package to scrape online XLS file.
url1 <- "https://www.asx.com.au/asx/research/ASXListedCompanies.csv"
ASXCompany <- read.csv(url1,skip = 2,strip.white = TRUE,stringsAsFactors = FALSE)
url2 <- "http://www.marketindex.com.au/sites/default/files/data-downloads/30-june-2018.xls"
ASX2018 <- read.xls(url2, sheet = 1, perl = "C:/Strawberry/perl/bin/perl.exe")
url3 <- "http://www.marketindex.com.au/sites/default/files/data-downloads/30-june-2019.xls"
ASX2019 <- read.xls(url3, sheet = 1, perl = "C:/Strawberry/perl/bin/perl.exe")
# Step 2: Rename some column names to make the following operation easier and prepare for the dataset join.
names(ASXCompany)[names(ASXCompany) == "ASX.code"] <- "ASX.Code"
names(ASXCompany)[names(ASXCompany) == "GICS.industry.group"] <- "Industry.Group"
names(ASXCompany)[names(ASXCompany) == "Company.name"] <- "Company.Name"
names(ASX2018)[names(ASX2018) == "Last.Price...."] <- "2018.Last.Price"
names(ASX2018)[names(ASX2018) == "Business.Date"] <- "2018.Business.Date"
names(ASX2018)[names(ASX2018) == "Abbreviated.Product.Description"] <- "Product.Description"
names(ASX2019)[names(ASX2019) == "Last.Price...."] <- "2019.Last.Price"
names(ASX2019)[names(ASX2019) == "Business.Date"] <- "2019.Business.Date"
names(ASX2019)[names(ASX2019) == "Abbreviated.Product.Description"] <- "Product.Description"
# step 3: Remove the leading and trailing space of varibale "ASX.Code" in the two datasets, so that it can be used as key to join with the "ASXCompany" dataset. Create/Muate a new variable of "Company.Code" to store the trimmed variable.
ASX2018 <- ASX2018 %>% mutate(Company.Code = str_trim(ASX.Code, side = "both"))
ASX2019 <- ASX2019 %>% mutate(Company.Code = str_trim(ASX.Code, side = "both"))
# Step 4: Join the three datasets to dataset "ASXALL" which contains data that come from all three datasets.
ASXAll <- ASXCompany %>% left_join(ASX2018, by=c("ASX.Code"="Company.Code")) %>% left_join(ASX2019, by=c("ASX.Code"="Company.Code"))
# Step 5: Provide outputs (head of data sets) as the assigment instruction requires.
head(ASXCompany)
head(ASX2018)
head(ASX2019)
head(ASXAll)
# Step 1: Convert "2018.Last.Price" and "2019.Last.Price" from factor to numeric type.
ASXAll$"2018.Last.Price" <- as.numeric(as.character(ASXAll$"2018.Last.Price"))
ASXAll$"2019.Last.Price" <- as.numeric(as.character(ASXAll$"2019.Last.Price"))
# Step 2: Convert "2018.Business.Date" and "2019.Business.Date" from factor to date type.
ASXAll$"2018.Business.Date" <- ymd(ASXAll$"2018.Business.Date")
ASXAll$"2019.Business.Date" <- ymd(ASXAll$"2019.Business.Date")
# Step 3: Convert "Industry.Group" from chr to ordered factor
ASXAll$"Industry.Group" <- factor(ASXAll$"Industry.Group", ordered = TRUE)
# Step 4: Issuer.Underlying.Issuer.Full.Name should be of char type
ASXAll$Issuer.Underlying.Issuer.Full.Name.x <- as.character(ASXAll$Issuer.Underlying.Issuer.Full.Name.x)
ASXAll$Issuer.Underlying.Issuer.Full.Name.y <- as.character(ASXAll$Issuer.Underlying.Issuer.Full.Name.y)
# Step 5: Check the types and structure of the dataset "ASXAll" after the above type conversion.
str(ASXAll)
## 'data.frame': 2191 obs. of 15 variables:
## $ Company.Name : chr "MOQ LIMITED" "1300 SMILES LIMITED" "1414 DEGREES LIMITED" "1ST GROUP LIMITED" ...
## $ ASX.Code : chr "MOQ" "ONT" "14D" "1ST" ...
## $ Industry.Group : Ord.factor w/ 26 levels "Automobiles & Components"<..: 22 12 3 12 5 4 19 17 9 15 ...
## $ ASX.Code.y : Factor w/ 5700 levels "1AD ","1AG ",..: 2920 3470 NA 5 4638 NA 4719 4851 4706 1499 ...
## $ Security.Group.Code.x : Factor w/ 6 levels "CNV","COY OPTION",..: 3 3 NA 3 3 NA 3 3 3 3 ...
## $ Issuer.Underlying.Issuer.Full.Name.x: chr " MOQ LIMITED " "1300 SMILES LIMITED " NA "1ST GROUP LIMITED " ...
## $ Product.Description.x : Factor w/ 3279 levels "0.01C PD 24.99C UNPD",..: 2721 2721 NA 2721 2721 NA 2735 2733 2721 2721 ...
## $ 2018.Last.Price : num 0.24 6.4 NA 0.038 0.002 ...
## $ 2018.Business.Date : Date, format: "2018-06-28" "2018-06-28" ...
## $ ASX.Code.y.y : Factor w/ 5912 levels "14D ","14DO ",..: 3032 3572 1 7 4751 NA 4823 4944 4809 1534 ...
## $ Security.Group.Code.y : Factor w/ 6 levels "CNV","COY OPTION",..: 3 3 3 3 3 NA 3 3 3 3 ...
## $ Issuer.Underlying.Issuer.Full.Name.y: chr " MOQ LIMITED " "1300 SMILES LIMITED " "1414 DEGREES LIMITED " "1ST GROUP LIMITED " ...
## $ Product.Description.y : Factor w/ 3458 levels " 5.5% LN 15-NOV-18",..: 3152 3152 3152 3152 3152 NA 2949 3224 3152 3152 ...
## $ 2019.Last.Price : num 0.195 6.35 0.295 0.068 0.001 NA 1.03 1.18 0.11 0.001 ...
## $ 2019.Business.Date : Date, format: "2019-06-28" "2019-06-28" ...
# Step 1: Check the dataset before the above operations.
ASXAll
# Step 2: Remove duplicate variables due to the earlier data join from the dataset
ASXAll <- ASXAll %>% select(-c("ASX.Code.y", "ASX.Code.y.y", "Security.Group.Code.y", "Issuer.Underlying.Issuer.Full.Name.y", "Product.Description.y"))
# Step 3: Rename some columns to make following operation easier
names(ASXAll)[names(ASXAll) == "Security.Group.Code.x"] <- "Security.Group.Code"
names(ASXAll)[names(ASXAll) == "Issuer.Underlying.Issuer.Full.Name.x"] <- "Issuer.Name"
names(ASXAll)[names(ASXAll) == "Product.Description.x"] <- "Product.Description"
# Step 4: Filter the dataset and keep only data for ordinary share (Product.Description = Ordinary) which is the focus of this report.
ASXAll <- ASXAll %>% filter(str_trim(Product.Description) == "ORDINARY")
# Step 5: Check the dataset after the above operations.
ASXAll
# Step 1: Mutate/Calculate the Delta Price and also the Delta Price in Percentage of the share between year 2018 and 2019. Delta.Price: the price changes between year 2019 and 2018; Delta.Percent: Share price change by percentage.
ASXAll <- ASXAll %>% mutate(Delta.Price = ASXAll$`2019.Last.Price` - ASXAll$`2018.Last.Price`)
ASXAll <- ASXAll %>% mutate(Delta.Percent = (ASXAll$`2019.Last.Price` - ASXAll$`2018.Last.Price`) / ASXAll$`2018.Last.Price`)
# Step 2: Check data output
ASXAll
Removed all the observations with missing values because the missing values found are only in the Price columns and if the price values are missing, it won’t make sense (won’t even be possible) to calculate the delta price betwee the two years. It also doesn’t make sense to impute the share price from either other shares or the other years of the same share as each company is unique and has different financial performance year over year, so observations with missing price value needs to be excluded.
Also removed the observations with share price value not come from the same but an earlier year (e.g.: “2019.Last.Price” has “Business Date” of an earlier year), because the main purpose here is to compare the share price performance between 2018 and 2019.
# Step 1: Scan and check if and how many missing values there are in each column
colSums(is.na(ASXAll))
## Company.Name ASX.Code Industry.Group
## 0 0 0
## Security.Group.Code Issuer.Name Product.Description
## 0 0 0
## 2018.Last.Price 2018.Business.Date 2019.Last.Price
## 3 3 11
## 2019.Business.Date Delta.Price Delta.Percent
## 11 13 13
# Step 2: Keep only the observations without missing values using complete.cases().
ASXAll <- ASXAll[complete.cases(ASXAll),]
# Step 3: Handle data inconsistency. The Last Price of the share in the year should be based on a date of the same year.
ASXAll <- ASXAll[(ASXAll$`2018.Business.Date` >= as.Date("2018-01-01") & ASXAll$`2019.Business.Date` >= as.Date("2019-01-01")),]
# Step 4: Check data output after above operations
ASXAll
z-score has been used to detect outliers, because we would like to have a nomalized distribution of data to evaluate the general financial growth of the ASX companies. Using z-score, it basically assures that the dataset has a standard deviation between 3 and -3, which means the data is neither too high of too low.
Also, the total number of outliers of all the 4 variables is 120 out of the dataset of 1585 observations. These companies may have experienced dramatic financial changes due to different unique and/or one-off reasons (e.g.: merge/aquisition, legal issue etc.) and may not reflect the ASX financial growth as a whole anyway. Therefore removing these outliers should be a safe practice and actually will improve the quality of this analysis.
# Step 1: Scan for outliers for all the four numeric variables using z-score.
ASXAll$`2018.Last.Price` %>% scores(type="z") %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2237 -0.2208 -0.2085 0.0000 -0.1314 17.7249
ASXAll$`2019.Last.Price` %>% scores(type="z") %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2219 -0.2199 -0.2102 0.0000 -0.1396 19.2588
ASXAll$Delta.Price %>% scores(type="z") %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -31.93385 -0.01362 0.00796 0.00000 0.01324 6.65249
ASXAll$Delta.Percent %>% scores(type="z") %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.38571 -0.21546 -0.11870 0.00000 -0.02625 22.78554
# Step 2: Remove the outliers according to the z-score result
ASXAll <- ASXAll[-which(abs(ASXAll$`2018.Last.Price` %>% scores(type = "z")) > 3),]
ASXAll <- ASXAll[-which(abs(ASXAll$`2019.Last.Price` %>% scores(type = "z")) > 3),]
ASXAll <- ASXAll[-which(abs(ASXAll$Delta.Price %>% scores(type = "z")) >3 ),]
ASXAll <- ASXAll[-which(abs(ASXAll$Delta.Percent %>% scores(type = "z")) > 3),]
# Step 3: Check data output
ASXAll
Reciprocal transformation is used because the data is not only right skewed, but also has negative and zero values, for which typical tranformation like log, square root won’t apply for.
Equal frequency binning is used to categorize the share price change to reflect the level of investment return of different shares.
# Step 1: Check variable Delta.Percent of the dataset, which is the main variable in the dataset that reflects the share price changes.
hist(ASXAll$Delta.Percent)
# Step 2: As the histogram shows the variable data is right skewed, use reciprocal transformation to reduce the right skewness.
hist(1/(ASXAll$Delta.Percent))
# Step 3: Bin/Discretize the variable of Delta.Percent with equal frequency binning.
Price_Binned <- ASXAll %>% select("Delta.Percent") %>% discretize(disc="equalfreq")
# Step 4: Check the variable after binning
ASXAll %>% select("Delta.Percent") %>% bind_cols(Price_Binned) %>% head(10)
# Step 5: Check data output
ASXAll